Pairwise Reranker

Also known as: pairwise scoring, pairwise judge

TL;DR

A reranker that scores by comparing two candidate documents head-to-head — model(query, doc_A, doc_B) → which is more relevant. More accurate than pointwise (transitivity arbitrage, calibration-free) but at inference.

A pairwise reranker takes a query plus two candidate documents and emits a single judgment: which one is more relevant. Its API is model(query, doc_A, doc_B) → p(A > B) — not a score per document, a preference per pair. To actually rerank a list, you run it across many pairs and aggregate.

PAIRWISE RERANKERCompare every pair, then sum the wins.PAIRWISE OUTCOME MATRIXcell (i, j) = p(dᵢ ≻ dⱼ)Ad₁₇Bd₄₂Cd₂₃Dd₈₁ABCD0.78A≻B0.66A≻C0.84A≻D0.22A≻B0.31C≻B0.48D≻B0.34A≻C0.69C≻B0.71C≻D0.16A≻D0.52D≻B0.29C≻DROW-SUM = Σⱼ p(dᵢ ≻ dⱼ) · fit Thurstone for calibrated scoresDERIVED RANKINGsort by row-sum of wins1.Ad₁₇Σ 2.282.Cd₂₃Σ 1.743.Bd₄₂Σ 1.014.Dd₈₁Σ 0.97COST · K(K−1)/2 PAIRSk = 46THIS DIAGRAM · CHEAPk = 1045CASCADE TOP-K · ~1 MSk = 1004,950FULL LIVE · ~7.4 SFour candidates enter the pairwise stage.

This sounds like an awkward shape for production retrieval, and it is. The reason pairwise rerankers matter is that they are the most accurate label source the field has — and most production rerankers are trained against their judgments, even when the production model itself is pointwise.

Pairwise vs pointwise

A emits one number per (query, doc) pair. A pairwise reranker emits one preference per (query, doc_A, doc_B) triple. The two are interconvertible at the output level — given pointwise scores, you can derive pairwise preferences; given pairwise preferences, you can fit a model to recover pointwise scores.

But they are not interconvertible at the learning level. Training a pairwise judge is much easier than training a calibrated pointwise scorer:

Why pairwise is the easier learning target
  • Calibration-free — the model never has to commit to “this is a 0.73”; it only has to know that A is better than B. No anchoring drift, no scale ambiguity.
  • Transitivity arbitrage — when the model judges A > B and B > C, the implicit A > C constraint propagates through Thurstone fitting. A pointwise model has no equivalent free signal.
  • Robust to per-rater bias — if the LLM judge that produced your training labels likes long documents 5% more than short ones, that bias is constant across pairs and largely cancels in the pairwise comparison. Pointwise targets carry the bias straight through.
  • Less noisy supervision — see : inter-rater agreement on “which is better” is typically 90%+; on “rate this 0-10” it can be below 0.6 correlation.

A pairwise reranker has to learn ordering. A pointwise reranker has to learn ordering plus a calibration. The first task is dramatically easier.

Why pairwise is the better label source

The honest argument for pairwise rerankers is not that they ship better — it is that the labels they produce are better than any pointwise label you could collect directly.

If you ask three frontier LLMs to score (query, doc) pairs on a 0-10 scale, they will return correlated but inconsistently-anchored numbers: one judge’s “7” is another’s “5”, and the same judge drifts across a session. Average them and you get a target that is calibrated to nothing in particular.

If you ask the same three judges “which of A or B is more relevant,” agreement is sharp, position bias can be neutralized by swap-and-re-ask, and the resulting graph of preferences contains strictly more information than the pointwise scores would have — because pointwise scores can be recovered from pairwise via Thurstone, but the reverse direction loses information about how confident the model was on each comparison.

This is the asymmetry that makes pairwise the natural label source for modern reranker training.

The teacher-student distillation pattern (this is what we actually do)

The reason a pairwise reranker matters in practice is not that it is the production model — it almost never is — but that it is the teacher in a pipeline. Concretely:

  1. Train (or prompt) a pairwise judge. For ZeroEntropy this is , a 4B cross-encoder distilled from an ensemble of frontier LLMs. Its only job is (q, d_A, d_B) → p(A > B).
  2. Run the pairwise judge over a sparse comparison graph. For each query with candidates, compare each candidate to others — edges instead of . The graph stays connected with diameter 2 at .
  3. Fit Thurstone. Recover a continuous Elo-style score per (query, doc) from the sparse pairwise graph. These are the fitted relevance scores — not annotations, recovered statistically from many head-to-head outcomes.
  4. Train the production pointwise model on the fitted scores. MSE regression: . The student is pointwise, fast, and inherits the teacher’s calibration.

The pairwise reranker never ships. It exists in the training pipeline, produces millions of judgments, and disappears. What ships is a pointwise that can be served at one forward pass per (query, doc).

This is the architecture behind zerank-1 and zerank-2 — see for the full pipeline.

It can be, and frontier LLMs are the original label source — but the throughput is wrong for training-scale data. A single Claude or GPT pairwise judgment is 10-50ms and a few cents. Generating ten million pairwise labels for a reranker training run is $100K+ in API cost and weeks of wall time.

The fix is a two-stage distillation. First, an ensemble of frontier LLMs labels a smaller seed set of pairs (~hundreds of thousands). Then a 4B cross-encoder is trained to mimic the ensemble’s pairwise judgments — binary cross-entropy loss on the ensemble probability . Once that pairwise SLM is trained, it judges new pairs at ~1000× the throughput of the LLM ensemble, which is what makes per-query Thurstone fitting tractable across millions of training queries.

The pairwise SLM is the teacher of the production pointwise model, but it is itself the student of the LLM ensemble. Two distillation stages, each one trading a small accuracy loss for a large throughput gain.

From pairwise judgments to a global score

The bridge from pairwise outputs to a usable scalar per document is a Thurstone or Bradley-Terry MLE. Given a graph of comparison outcomes, the MLE recovers a continuous latent score per item such that .

Two non-obvious properties make this work in production:

  • Sparse graphs suffice. You do not need comparisons. As long as the comparison graph is connected with bounded diameter, edges recover stable scores. zELO uses — every doc compared to four others — and the recovered scores land within 0.02 of the dense-pairwise fit.
  • The MLE corrects for strength-of-schedule. A naive win-rate (“doc A won 8 of 10 comparisons”) confounds the doc’s quality with the strength of its opponents. The Thurstone MLE explicitly factors that out, which is why it gives stable scores from sparse, irregular graphs where simple averaging would not.

The recovered scores become the regression targets for the pointwise student. See and for the underlying statistics.

When pairwise is slow enough that it ships in production anyway

There is a narrow operating point where the pairwise reranker survives all the way to query time:

  • Cascade top-k re-rerank. A uses a cheap pointwise model to narrow 100 candidates to 10, then a pairwise model on those 10 — pairwise calls, well under a millisecond on a small cross-encoder. The pairwise stage handles the hardest tie-breaks at the very top of the result list, where pointwise calibration error matters most.
  • High-stakes domains with small candidate sets. Legal document review, clinical decision support, code search inside a single repository — settings where and the cost asymmetry between a wrong top-1 and a one-millisecond extra latency is enormous.
  • Offline reranking jobs. Batch reranking of an evaluation set, A/B comparison of two candidate models, periodic rescoring of a curated catalog. No query-time SLA, so is fine.

Outside these niches, pairwise is the teacher and pointwise is what ships.

The pattern holds for too: more accurate label source, prohibitively expensive at query time, distilled into a pointwise student. Pairwise sits in the middle of that hierarchy — slower than pointwise, faster than full listwise, and the most economically attractive teacher of the three.

Go further

If pairwise rerankers are more accurate, why isn't there a 'pairwise' option in production reranker APIs?

Because shipping a pairwise model at query time means cross-encoder calls per query — 4,950 forwards on 100 candidates instead of 100. That's a 50× cost increase for ~1-2 NDCG points. The economically rational move is to keep pairwise as a teacher inside the training pipeline and ship a pointwise student that inherits its judgments via distillation.

How do pairwise judgments turn into a single score per document?

Fit a Thurstone or Bradley-Terry model. The MLE recovers a continuous score per item from the graph of (A beats B) outcomes — same statistical machinery as chess Elo. You don't need dense comparisons; sparse -regular graphs () recover scores within ~0.02 of the dense fit at 0.4% the cost.

Is there a setting where the pairwise reranker actually ships in production?

Yes — top-k re-rerank stages in cascades. After a cheap pointwise model narrows 100 candidates to 10, running pairwise on those 10 is 45 comparisons, well under a millisecond. The pairwise model handles the hardest tie-breaks where pointwise calibration error matters most. Outside small- cascades, pairwise stays in the training loop.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord