Also known as: sigmoid-loss CLIP, Sigmoid Language-Image Pretraining
TL;DR
SigLIP (Zhai et al. 2023) replaces CLIP's softmax contrastive loss with a per-pair sigmoid loss, decoupling each (image, text) pair from the rest of the batch.
SigLIP (Sigmoid Language-Image Pretraining) is the modern successor to CLIP for training multimodal embeddings . The architecture is identical — two encoders, one image, one text, projected into a shared space — but the training objective swaps softmax-normalized InfoNCE for a per-pair sigmoid binary classifier. That single change rewrites the gradient structure of the loss, and the result is a model that trains more stably, scales to larger batches, and reaches CLIP-equivalent quality with substantially less compute.
Zhai et al. published the recipe in 2023; by 2024 it had effectively replaced softmax CLIP as the default open-source vision-language pretraining objective. SigLIP-2 (2024) extended the approach with self-distillation, masked-prediction auxiliary losses, and richer captioning data — and is the current default checkpoint for most vision-language stacks.
What changes mechanically
In CLIP-style training, you compute an N x N similarity matrix between N image embeddings and N text embeddings in a batch, then apply a temperature-scaled softmax along each axis and read off the cross-entropy. Each pair’s loss depends on every other pair in the batch through the softmax denominator.
SigLIP throws out the softmax. Each (image, text) similarity becomes its own logit, fed through a sigmoid, and treated as a binary classifier — “are these two paired?” The diagonal entries of the similarity matrix are positives (label 1), the off-diagonal entries are negatives (label 0). The total loss is the sum of N^2 independent binary cross-entropies. There is no partition function; no pair sees any other pair.
Why it actually wins
Softmax contrastive loss leans heavily on a large pool of in-batch negatives to provide signal — the gradient for a positive pair is essentially “be more similar than the hardest of N-1 negatives.” When N is small, the hardest negative is often easy, so the gradient becomes weak and noisy.
Sigmoid loss has no such dependence. Every pair contributes a fixed-magnitude gradient regardless of how the rest of the batch looks. The model still benefits from negatives, but the loss doesn’t require a representative negative pool to train. Empirically this means SigLIP starts learning useful features at batch 4K where CLIP is essentially noise, and the gap closes only above batch 32K.
There’s also a numerical-stability story. The softmax in InfoNCE is computed in float16 across thousands of pairs; the log-sum-exp can overflow or underflow without careful tricks. Sigmoid + binary-cross-entropy has none of these issues — every term is bounded.
CLIP uses a learned temperature parameter to scale logits before softmax — typical values around 0.01. The temperature controls how sharply the loss focuses on the hardest negative; tuning it badly tanks training.
SigLIP keeps a temperature and adds a learned bias. The bias compensates for the prior imbalance — in a batch of N pairs you have N positives and N(N-1) negatives, so without the bias the sigmoid would learn to predict “negative” for everything. The bias is initialized to log(N) and shifts during training. Both the temperature and bias are learned, but neither is brittle: SigLIP is much less sensitive to their initialization than CLIP is to its temperature.
What ships with the recipe
Concrete SigLIP variants in the wild
SigLIP-Base / Large / SoViT-400m — Google’s original 2023 checkpoints, multiple resolutions, multiple patch sizes.
SigLIP-2 (2024) — self-distillation + masked image modeling + better captions; current default for new VLMs.
SigLIP image towers in PaliGemma, LLaVA-NeXT, InternVL — many open-source vision-language models load a frozen SigLIP image encoder upstream of an LLM.
mSigLIP — multilingual variant trained on 100+ languages of paired captions.
Where it sits in the broader stack
A SigLIP encoder pair is the natural drop-in for any pipeline that previously used CLIP — zero-shot classification, image retrieval, the visual frontend of a vision-language model, or as a candidate generator for visual question answering . The output dimension and pooling behavior match CLIP closely enough that adapter weights often transfer with minor finetuning.
The honest summary: SigLIP is what CLIP would have been if Zhai et al. had been on the original team. Sigmoid loss is the kind of one-line change that looks trivial in hindsight and reshapes a field. If you are training new vision-language pretraining today, you start from SigLIP-2.
Paper
loading…
Sigmoid Loss for Language Image Pre-Training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer
We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP). Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes.
Go further
Why does sigmoid loss train more stably than softmax InfoNCE?
Softmax InfoNCE makes every pair's gradient depend on every other pair in the batch through the partition function, so loss curvature is sensitive to batch composition and to the temperature parameter. Sigmoid loss is a sum of independent binary cross-entropies — each pair is its own classifier — so gradients are local and well-conditioned even at very large batch size.
Yes — at matched compute and data, SigLIP-Base hits ~73% zero-shot ImageNet vs ~68% for OpenCLIP-Base, and the gap widens at smaller batch sizes. Zhai et al. showed sigmoid loss reaches CLIP-equivalent quality at batch 16K that CLIP needs batch 32K to match, and the loss continues to improve out to batches of a million.
When should I pick SigLIP over CLIP for production?
Almost always, today. SigLIP-2 checkpoints are open, the recipe is well-understood, and inference cost is identical to CLIP. The only reasons to stick with OpenAI CLIP are legacy pipelines or specific downstream models (some VLMs were trained against a CLIP image tower and finetuning them onto SigLIP is non-trivial).