Image Encoder

Also known as: vision encoder, image tower, vision backbone

TL;DR

An image encoder maps a raw image into a sequence of patch embeddings or a pooled vector. Modern multimodal stacks use Vision Transformer (ViT) encoders that tokenize the image into 16x16 or 14x14 patches.

The image encoder is the vision-side half of a multimodal model — the network that takes a raw image and produces a representation that downstream components (a contrastive head, an LLM, a classifier) can consume. In modern stacks it is almost always a Vision Transformer (ViT): the image is split into a grid of fixed-size patches, each patch is linearly projected into a token embedding, position embeddings are added, and a standard transformer encoder runs self- attention over the resulting sequence.

The output is either a sequence of patch embeddings (one per patch, often plus a special CLS token) or a single pooled vector. Which output you take depends on the consumer: a vision-language model wants the full sequence; a retrieval head wants the pooled vector. Both come from the same forward pass.

Architectures, briefly

Three families have shipped seriously in production.

Image encoder families

CNN-based — ResNet (2015) and EfficientNet (2019). Dominant pre-2021. Inductive bias toward local spatial patterns; strong on small datasets; weaker scaling laws than ViTs above 10M images.
ViT-based — ViT (Dosovitskiy 2020), DeiT, BEiT, EVA, SigLIP image tower. Treats the image as a sequence of patches. Dominant 2022-onward. Better scaling, weaker small-data regularization.
Hybrid — ConvNeXt (2022), Swin Transformer (2021), CoAtNet. Convolutional stem (early layers) feeding into transformer blocks. Aims for the inductive bias of CNNs plus the scaling of ViTs. Common in dense-prediction tasks (segmentation, detection).

For multimodal pretraining (CLIP, SigLIP , EVA-CLIP), the choice is universally ViT or a ViT-hybrid. The image tower in a vision-language model is essentially never a pure CNN today.

Patch tokenization, mechanically

A 224x224 RGB image with 14x14 patches yields 16x16 = 256 patches. Each patch is 14143 = 588 values, linearly projected to whatever the transformer’s hidden dim is (768 for ViT-Base, 1024 for ViT-Large). Add learned 2D position embeddings, prepend a CLS token, run through 12-32 transformer blocks. Output: 257 token embeddings.

The patch size is a meaningful knob. Smaller patches (8x8 or even 4x4) capture finer detail at quadratic cost. Larger patches (32x32) are cheap but blur small objects. The 14x14 / 16x16 default is the empirical sweet spot for 224-resolution inputs at base/large scale.

Distinction from text encoders

The two encoders in a CLIP-style model are similar in kind — both transformers, both producing sequences of token embeddings — but differ in the input pipeline and in inductive bias.

Input pipeline. A text encoder takes discrete token IDs (from BPE tokenization), looks them up in a learned embedding table, and adds 1D position embeddings. An image encoder takes continuous pixel values, applies a learned linear projection over patch pixels — that’s the “tokenization” step, but it has no embedding table — and adds 2D position embeddings.

Sequence length. Text encoders typically see sequences of 32-128 tokens for captions; image encoders see 256-1024 patches at the same resolution. The image side has more tokens, which is part of why the modality balance is hard in CLIP-style training (more tokens = more compute on the image side).

Inductive bias. Text has strong sequential and hierarchical structure; image patches are spatially organized in 2D and don’t have a natural ordering. Position embeddings differ: 1D learned (text) vs 2D learned or sinusoidal (image). Some encoders (like Pixel-Aligned ViTs) push further with hierarchical attention, but the standard ViT just treats patches as a flat sequence with 2D position info.

Output usage. Text encoders typically pool to a single sentence embedding. Image encoders are increasingly kept as token sequences downstream — a VLM’s projector layer maps each patch token to the LLM’s hidden dim and feeds them in as input tokens. The pooled vector matters less in the modern stack than it did circa CLIP-2021.

The pooled-vector vs token-sequence question

For retrieval and classification, you want one vector per image — the CLS token or a mean-pooled patch sequence is fine. For vision-language models that need to reason about specific regions (“what’s in the upper left?”), you keep the full patch sequence so the LLM can attend to individual patches. ColPali extends this further: keep the patch sequence and apply ColBERT -style late interaction, treating each patch as a retrievable unit. That works strikingly well for document retrieval over PDF screenshots.

For most production workloads — search, moderation, similarity — the pooled vector from a SigLIP-2 image tower is what you want. It’s cheap, it’s well-calibrated against the text tower, and it’s the format the rest of your retrieval stack expects.

Go further

ViT vs CNN — which wins for vision encoding?

ViT wins above ~10M training images. Below that scale CNNs match or beat ViTs because their convolutional inductive bias regularizes small datasets. Above it, ViTs scale better — more parameters and more data both help ViTs more than CNNs. Every state-of-the-art vision encoder since 2022 is a ViT or a hybrid; pure CNN backbones are now legacy except for edge inference.

Transformer Attention

Why patch tokenization at all? Why not pixel-level attention?

Self-attention is O(N^2) in sequence length. A 224x224 image has ~50,000 pixels — quadratic attention over that is infeasible. Splitting into 14x14 patches gives 256 tokens, which is tractable. The cost of patchification is a small loss of fine-grained spatial detail; the win is being able to use the full transformer toolkit.

Attention Flash attention

Pooled vector or patch sequence — which output do I want?

Patch sequence for downstream models that consume tokens (VLMs, late-interaction retrieval, dense prediction). Pooled vector for retrieval, classification, and any application that wants a single fixed-size embedding. Most modern image encoders expose both — the patch sequence is the primary output, and the pooled vector is computed from it via a CLS token or attention pooling.

Multimodal embeddings ColBERT

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs