Also known as: MNRL, multiple negatives ranking, MultipleNegativesRankingLoss
TL;DR
MNRL is a contrastive ranking loss that scores a query against one positive and many negatives, then trains the positive to score highest. Popularized by sentence-transformers, it's the workhorse loss for fine-tuning bi-encoders on labeled pairs.
Multiple Negatives Ranking Loss (MNRL) is the loss most teams reach for when fine-tuning a bi-encoder on labeled (query, relevant document) pairs. For each positive pair in a batch of , treat the other documents as negatives and apply softmax cross-entropy over the similarity scores. The positive should be the highest-scoring entry; everything else should be pushed down.
This is mathematically identical to InfoNCE — the name comes from how sentence-transformers exposed the loss as MultipleNegativesRankingLoss in its first public release. In practice the terms get used interchangeably.
Why it’s the default for fine-tuning
Three reasons MNRL became the workhorse:
No explicit negatives required. Feed it a list of positive pairs and it builds negatives from the batch itself. In-batch negatives are free supervision.
Linear in batch size. The accuracy lift from MNRL keeps climbing as grows, so any extra GPU memory turns directly into a stronger embedder.
Composable with hard negatives. When in-batch saturates, you append mined hard negatives to each row of the batch. The loss expression doesn’t change — you just have more columns in the similarity matrix.
MNRL is the lowest-friction way to teach a bi-encoder a new domain. Five thousand labeled positive pairs plus a sane batch size plus 1-3 epochs is usually enough to lift NDCG by 5-15 points on a specialized corpus.
The typical fine-tuning recipe
MNRL fine-tuning workflow
Collect 1K-50K (query, relevant-doc) pairs from search logs, click data, or synthetic generation.
Mine 5-15 hard negatives per query using a strong baseline first-pass model (BM25 + the current embedder).
Filter out false negatives — any “negative” whose label is ambiguous gets dropped, or you’ll train the model to push real matches apart.
Fine-tune with MNRL at the largest batch your hardware holds, temperature 0.05-0.1, 1-3 epochs.
Evaluate on a held-out eval set with NDCG@10 and recall@10. Stop the moment the eval curve plateaus — MNRL overfits fast on small datasets.
What MNRL does not do
It optimizes ranking, not calibration . After MNRL fine-tuning, the absolute similarity values still aren’t probabilities — they’re rank-correct numbers in an arbitrary scale. If you need calibrated relevance scores downstream (for thresholding, abstention, fusion), pair MNRL with a Platt or isotonic calibration step on a held-out set, or move to a training method that produces calibrated targets natively.
The softmax temperature controls how sharply the loss focuses on the single hardest negative. Low (~0.05) makes the gradient mostly come from the most confusable negative — fast learning, but unstable if the hardest negative is actually a false positive. High (~0.2) spreads gradient across many negatives — stable, but the model never sharpens against near-duplicates.
The sweet spot for retrieval embedders is empirically 0.05-0.1. Below that, models collapse to a degenerate solution where everything sits on a single hypersphere shell; above that, recall@1 stops improving. Most sentence-transformers recipes default to = 0.05.
Go further
How does MNRL differ from InfoNCE?
Mathematically they're the same softmax cross-entropy over similarity scores. MNRL is the name the sentence-transformers library picked when it shipped the loss as a one-liner for bi-encoder fine-tuning. So 'we trained with MNRL' and 'we trained with InfoNCE' usually describe identical optimization, just from different framings.
When do you supply explicit negatives instead of relying on in-batch?
Once the model saturates on random in-batch negatives — usually after a few thousand steps. Mined hard negatives (5-15 per query, scored by a strong first-pass model and filtered) give MNRL a non-trivial gradient signal long after in-batch loss flatlines. Most production fine-tuning recipes stack both.
As big as your hardware allows. The loss is a softmax across (1 positive + N negatives), so every extra negative tightens the ranking signal. Sentence-transformers defaults to batch 64-128 on a single GPU; serious embedder runs push to 4K-32K via gradient caching and cross-device gathers.