Vision Transformer (ViT)

Also known as: ViT, image transformer, patch transformer

TL;DR

The Vision Transformer applies a standard transformer to image patches instead of words. An image is cut into a grid of 16×16 patches, each linearly embedded into a token, fed to a transformer encoder with positional encodings.

The Vision Transformer (ViT, Dosovitskiy et al. 2020) is the result of asking “what if we just fed images to a transformer?” An image is split into a grid of fixed-size patches — typically 16×16 pixels — each patch is flattened and linearly projected to a token embedding, position encodings are added, and the resulting sequence is processed by a standard transformer encoder. The architecture is almost identical to BERT; the only image-specific piece is the patch embedding step.

ViT replaced ConvNets as the default vision backbone wherever scale is available. Every modern multimodal embedding model , every vision-language model, and every diffusion model conditioned on image features uses a ViT-shaped encoder underneath.

Patch embedding, the only image-specific piece

A 224×224 RGB image becomes a 14×14 grid of 16×16 patches — 196 tokens. Each patch is a 768-dim vector (16·16·3) linearly projected to the model dimension (also 768 in ViT-B). A learnable [CLS] token is prepended, learned position embeddings are added, and the 197-token sequence enters the transformer.

Implementation note: the patchify-then-project step is mathematically equivalent to a single 2D convolution with kernel size and stride both equal to the patch size. Reference implementations use the conv form because it’s faster and avoids reshape gymnastics.

Attention without a spatial inductive bias

A ConvNet encodes “nearby pixels matter together” by construction — its kernels only see local neighborhoods. A ViT does not. Every patch can attend to every other patch from layer one. The model has to learn spatial structure from data, with position embeddings as the only hint that patch ordering matters.

This is both ViT’s strength and its weakness. Strength: at scale the model discovers attention patterns more flexible than fixed convolutional receptive fields. Weakness: with little data, it underfits relative to a ConvNet of similar size.

Variants and sizes

The ViT family scales the standard transformer dimensions. ViT-B (86M params, 12 layers, 768 dim), ViT-L (307M, 24 layers, 1024 dim), ViT-H (632M, 32 layers, 1280 dim), ViT-G (1.8B). Patch size is the other knob: ViT-B/16 uses 16×16 patches; ViT-B/14 uses 14×14, producing more tokens at the cost of compute.

ViT in the wild

CLIP image encoder — ViT-L/14 is the canonical OpenAI CLIP vision tower.
DINOv2 — self-supervised ViT producing strong general-purpose visual features without labels.
SigLIP / SigLIP 2 — ViT trained with sigmoid contrastive loss; the current open-weight default.
Stable Diffusion’s CLIP text-conditioning uses a ViT-trained image encoder upstream during training.
LLaVA, Qwen-VL, Claude vision — every modern VLM uses a ViT to tokenize images for the language model.

Self-attention treats its input as a set, not a sequence — without position information, shuffling patches would produce identical outputs. Since spatial layout obviously matters for vision, ViT adds learned 1D position embeddings to each patch token. Curiously, 2D position embeddings (encoding row and column separately) don’t measurably help — the model recovers spatial structure from 1D embeddings just fine. Later work (RoPE-2D, axial attention) revisits this, but plain learned 1D position embeddings remain the dominant choice.

For pure classification on small datasets, modern ConvNets (ConvNeXt, EfficientNet) are still competitive and often easier to train. For self-supervised representation learning at scale, ViT wins — every state-of-the-art image encoder since 2022 is a ViT. For vision-language alignment, ViT wins by default because the downstream LLM is also a transformer and the architectures compose naturally. Hybrid designs (Swin, MaxViT) inject some convolutional inductive bias back into the early layers; they trade some scale-ceiling for better data efficiency.

Why ViT matters for multimodal

Every concept in this topic depends on ViT or a close descendant. CLIP uses a ViT image encoder paired with a text transformer. VLMs feed ViT patch tokens directly into an LLM context. Diffusion models use ViT-derived features for conditioning. The architectural unification of vision and language under one transformer paradigm is what makes the modern multimodal stack possible.

Go further

Why do ViTs only beat ConvNets at scale?

ConvNets bake in a strong inductive bias (translation equivariance, local receptive fields) that lets them learn from small datasets efficiently. ViTs have no such bias and must learn spatial structure from data. Below ~10M images, ConvNets win; above ~100M, ViTs catch up and pass them. The Dosovitskiy et al. (2020) paper made this scaling crossover explicit.

Transformer Attention

What is the [CLS] token actually doing?

It's a learned vector prepended to the patch sequence whose final representation is read out as the image-level embedding. Self-attention lets it pool information from every patch. Some later ViT variants drop it entirely and pool patch tokens with mean or attention pooling — the choice barely affects accuracy.

Encoder model Image encoder

How are patches actually embedded?

A 16×16×3 patch is flattened to a 768-dim vector and linearly projected to the model dimension. Equivalently, a single conv layer with kernel size 16 and stride 16 implements the same operation in one shot — most reference implementations use the conv form for speed.

Positional encoding

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs