CLIP

Also known as: CLIP model, Contrastive Language-Image Pretraining

TL;DR

CLIP (Contrastive Language-Image Pretraining, Radford et al. 2021) is a dual-encoder model that embeds images and text into a shared vector space. It is trained contrastively on 400M (image, caption) pairs scraped from the web.

CLIP (Contrastive Language-Image Pretraining) is a 2021 OpenAI architecture that learns a joint embedding space for images and text by training two encoders to agree on which captions go with which images. A vision transformer embeds images, a text transformer embeds captions, and a symmetric InfoNCE loss pulls each (image, caption) pair together while pushing all the off-diagonal pairs in the batch apart. The training data is 400M pairs scraped from the public web — no human labels, no curated dataset.

The dominant paradigm for joint multimodal embedding spaces. The reason a single image search box can take “a red sports car at sunset” and return matching photos.

The recipe in one paragraph

Two towers. Image tower: ViT-L/14 (or ViT-B/32 in smaller variants) produces a 768-dim vector. Text tower: a 12-layer GPT-style transformer producing the same. Both are projected to a shared 512-dim space via learned linear heads and L2-normalized. For a batch of N pairs, compute the N×N similarity matrix, scale by a learned temperature, and apply softmax cross-entropy along both rows and columns (image-to-text and text-to-image). Average the two losses. That’s the entire training objective.

Batch size is load-bearing. The original CLIP used 32,768. Larger batches mean more negatives in the denominator of the InfoNCE softmax, which yields a sharper signal. Below ~4k the model degrades noticeably.

What you can do with CLIP

CLIP capabilities, all zero-shot

Image classification — embed candidate class names ("a photo of a {class}") and pick the nearest text embedding.
Image retrieval from text — embed the query, find nearest image embeddings.
Text retrieval from images — reverse direction, useful for captioning datasets.
Image-image similarity — drop the text tower entirely; CLIP’s vision encoder is a strong general-purpose visual feature extractor.
Conditioning signal — feed CLIP text embeddings into a diffusion model to do text-to-image generation.
VLM image tokenizer — most VLMs start by passing the image through a frozen CLIP encoder.

Where CLIP breaks

Three classes of failure. Compositional reasoning — CLIP often retrieves images containing the right objects but the wrong relationships. “A red cube on a blue sphere” and “a blue cube on a red sphere” produce nearly identical embeddings. Bag-of-concepts behavior emerges from short captions. Fine-grained text — CLIP can’t read text in images reliably. The model treats text as visual texture rather than semantic content; OCR is a separate problem. Long captions — CLIP’s text encoder context is 77 tokens, and most captions in training are under 20 tokens. Feed it a paragraph and it averages it into a degraded vector. Long-CLIP and similar variants extend context but the underlying training signal didn’t.

After training, image embeddings and text embeddings occupy separate cones of the unit sphere. They’re closer to their cross-modal positives than to random negatives, but the two modalities never fully interleave. Liang et al. (2022) measured this carefully — every CLIP-style model exhibits it.

Two contributing factors. First, initialization: the encoders start in different regions of the embedding space and the contrastive loss is satisfied long before they meet. Second, the loss explicitly only pushes positive pairs together; it never forces intra-modal samples to mix with inter-modal ones. The gap is a measurable property of every CLIP-trained model and a known headache for cross-modal retrieval calibration.

Why CLIP shaped everything that came after

The recipe — two encoders, one contrastive objective, web-scale paired data — generalizes. CLIP-style training has been applied to (audio, text), (video, text), (3D, text), and (protein, text) pairs. Vision-language models use CLIP-like image encoders as their visual front-end. Diffusion models use CLIP text embeddings as conditioning signal. The 2021 paper is one of the highest-leverage architectural contributions of the decade — its descendants are everywhere.

Go further

Why was CLIP a paradigm shift?

Before CLIP, vision models trained on labeled classification datasets (ImageNet) and didn't transfer well to anything off-distribution. CLIP showed that contrastive training on noisy web-scale (image, caption) pairs produces a single model that does zero-shot classification, retrieval, and serves as a frozen feature extractor for downstream tasks. The recipe generalized everywhere.

Contrastive learning Multimodal embeddings

What is the difference between CLIP and SigLIP?

SigLIP (Zhai et al. 2023) replaces CLIP's softmax-over-batch InfoNCE with a per-pair sigmoid loss. This decouples the loss from batch size — CLIP needs 32k+ batches to converge well; SigLIP works with much smaller ones, which means cheaper training and accessibility outside frontier labs. SigLIP-2 is the current open-weight default.

SigLIP InfoNCE loss

Can CLIP be used for text-only retrieval?

Poorly. CLIP's text encoder is trained on short captions and never sees long-form text or fine-grained sentence semantics. It loses badly to a same-size text-only embedder on text retrieval benchmarks. Use CLIP's text encoder only for cross-modal queries against images; use a real text embedder for everything else.

Multimodal embeddings Embedding

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs