Multiple Negatives Ranking Loss

Also known as: MNRL, multiple negatives ranking, MultipleNegativesRankingLoss

TL;DR

MNRL is a contrastive ranking loss that scores a query against one positive and many negatives, then trains the positive to score highest. Popularized by sentence-transformers, it's the workhorse loss for fine-tuning bi-encoders on labeled pairs.

Multiple Negatives Ranking Loss (MNRL) is the loss most teams reach for when fine-tuning a on labeled (query, relevant document) pairs. For each positive pair in a batch of , treat the other documents as negatives and apply softmax cross-entropy over the similarity scores. The positive should be the highest-scoring entry; everything else should be pushed down.

loss = -log( exp(sim(q, d+) / τ) / Σⱼ exp(sim(q, dⱼ) / τ) )

This is mathematically identical to — the name comes from how sentence-transformers exposed the loss as MultipleNegativesRankingLoss in its first public release. In practice the terms get used interchangeably.

Why it’s the default for fine-tuning

Three reasons MNRL became the workhorse:

  1. No explicit negatives required. Feed it a list of positive pairs and it builds negatives from the batch itself. are free supervision.
  2. Linear in batch size. The accuracy lift from MNRL keeps climbing as grows, so any extra GPU memory turns directly into a stronger embedder.
  3. Composable with hard negatives. When in-batch saturates, you append mined hard negatives to each row of the batch. The loss expression doesn’t change — you just have more columns in the similarity matrix.

MNRL is the lowest-friction way to teach a bi-encoder a new domain. Five thousand labeled positive pairs plus a sane batch size plus 1-3 epochs is usually enough to lift NDCG by 5-15 points on a specialized corpus.

The typical fine-tuning recipe

MNRL fine-tuning workflow
  • Collect 1K-50K (query, relevant-doc) pairs from search logs, click data, or synthetic generation.
  • Mine 5-15 hard negatives per query using a strong baseline first-pass model (BM25 + the current embedder).
  • Filter out false negatives — any “negative” whose label is ambiguous gets dropped, or you’ll train the model to push real matches apart.
  • Fine-tune with MNRL at the largest batch your hardware holds, temperature 0.05-0.1, 1-3 epochs.
  • Evaluate on a held-out eval set with NDCG@10 and recall@10. Stop the moment the eval curve plateaus — MNRL overfits fast on small datasets.

What MNRL does not do

It optimizes ranking, not . After MNRL fine-tuning, the absolute similarity values still aren’t probabilities — they’re rank-correct numbers in an arbitrary scale. If you need calibrated relevance scores downstream (for thresholding, abstention, fusion), pair MNRL with a Platt or isotonic calibration step on a held-out set, or move to a training method that produces calibrated targets natively.

The softmax temperature controls how sharply the loss focuses on the single hardest negative. Low (~0.05) makes the gradient mostly come from the most confusable negative — fast learning, but unstable if the hardest negative is actually a false positive. High (~0.2) spreads gradient across many negatives — stable, but the model never sharpens against near-duplicates.

The sweet spot for retrieval embedders is empirically 0.05-0.1. Below that, models collapse to a degenerate solution where everything sits on a single hypersphere shell; above that, recall@1 stops improving. Most sentence-transformers recipes default to = 0.05.

Go further

How does MNRL differ from InfoNCE?

Mathematically they're the same softmax cross-entropy over similarity scores. MNRL is the name the sentence-transformers library picked when it shipped the loss as a one-liner for bi-encoder fine-tuning. So 'we trained with MNRL' and 'we trained with InfoNCE' usually describe identical optimization, just from different framings.

When do you supply explicit negatives instead of relying on in-batch?

Once the model saturates on random in-batch negatives — usually after a few thousand steps. Mined hard negatives (5-15 per query, scored by a strong first-pass model and filtered) give MNRL a non-trivial gradient signal long after in-batch loss flatlines. Most production fine-tuning recipes stack both.

What batch size does MNRL want?

As big as your hardware allows. The loss is a softmax across (1 positive + N negatives), so every extra negative tightens the ranking signal. Sentence-transformers defaults to batch 64-128 on a single GPU; serious embedder runs push to 4K-32K via gradient caching and cross-device gathers.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord