In-Batch Negatives

Also known as: batch negatives, implicit negatives, siblings as negatives

TL;DR

The simplest way to scale contrastive training: treat every other example in the same batch as a negative for the current positive pair. Free supervision, no extra forward passes. The reason embedder training cares about batch size.

In-batch negatives are the cheap, dominant default for contrastive embedding training . For each positive pair in a batch of examples, the other documents in the batch are reused as negatives. No extra forward passes, no extra data loading — the negatives are already in GPU memory because you embedded them for their own positive pairs.

The result: for a batch of B=4096 query-document pairs, every query effectively has 4095 negatives. The InfoNCE softmax becomes a 4096-way classification problem with the right document as the correct answer. This is the entire reason embedder training pipelines pour engineering effort into running the largest batch the hardware can hold.

Why batch size is the embedder hyperparameter

Increase from 256 to 8192 and downstream retrieval metrics climb monotonically — usually 2-4 NDCG points across MTEB on the same model and data. The mechanism is information-theoretic: InfoNCE bounds mutual information between query and document representations as . Bigger , tighter bound, better embeddings.

This is why state-of-the-art embedders train on hardware optimized for activation memory rather than parameter count. zembed-1, E5-Mistral, BGE-M3, GritLM all use four-digit batch sizes during contrastive pretraining. When you can’t fit a batch on one GPU, GradCache and DDP-with-AllGather are the standard tricks to grow effective batch size beyond per-device memory.

What in-batch negatives can’t fix

In-batch negatives are random relative to the query — they’re whatever happened to land in the same batch. Random samples from the data distribution are usually trivially separable: a query about diabetes treatment and a document about NBA scores have almost orthogonal embeddings already, and the loss gradient on that pair is near zero.

After the first few thousand training steps, in-batch loss saturates. The model perfects the easy negatives and stops learning. To keep training useful you have to introduce negatives that are deliberately similar to the positive — see hard-negative mining .

The standard recipe stacks both: in-batch negatives for breadth, plus 5-15 explicitly mined hard negatives per query for depth. The hard negatives drive the actual learning signal once in-batch saturates.

Cross-device gathering: the engineering bit

Each GPU naturally holds only its local microbatch. To make negatives from other GPUs available, you all_gather the embeddings across the data-parallel group before computing the loss. After the gather, every GPU sees every embedding in the global batch and computes loss against all of them. Gradients flow back through the gather. The trick: only the local-microbatch positives carry gradient on the local GPU, but the remote embeddings appear as additional negative columns in the local similarity matrix. This is how a 16-GPU run with per-device batch 256 trains as if batch size were 4096.

Because batch composition — not just size — silently determines model quality. Random shuffling is the default but produces topic clusters by chance. Topic-stratified sampling reduces false negatives but may shrink the effective batch diversity. Some papers use multi-source batch construction (mix many corpora per batch) to get topical diversity for free. The published “batch size = N” line hides whether N negatives were actually informative.

Cost in one number

In-batch negatives are free: zero additional FLOPs, zero additional data movement. The only costs are the GPU memory to hold larger batches and the engineering of the all-gather. Mining hard negatives doubles or triples training cost per step; in-batch is the highest-leverage knob in the embedder training pipeline.

Go further

Why does the embedder community obsess over batch size?

Because every other example in the batch is a free negative. Doubling batch size doubles the number of negatives per positive, which tightens the InfoNCE mutual-information bound. State-of-the-art embedders train at batch sizes of 8K-32K precisely because the loss benefits keep climbing.

InfoNCE loss Contrastive learning

What happens when a batch accidentally contains true positives as negatives?

The loss actively trains the model to push apart what should be matched. At small batch sizes this is rare; at 32K batches with topic-correlated data, false negatives become the limiting factor. Mitigations include topic-stratified sampling, false-negative masking via similarity thresholds, and lower temperatures that focus gradient on the genuinely hardest negatives.

Hard-negative mining Contrastive learning

Are in-batch negatives enough on their own?

For a decent embedder, yes. For a state-of-the-art one, no. In-batch negatives tend to be too easy — random examples from the same data distribution are usually obviously different from the positive. To push past that ceiling you need explicit hard negatives mined from the corpus.

Hard-negative mining Embedding

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs