If pairwise rerankers are more accurate, why isn't there a 'pairwise' option in production reranker APIs?
Because shipping a pairwise model at query time means
Also known as: pairwise scoring, pairwise judge
A reranker that scores by comparing two candidate documents head-to-head — model(query, doc_A, doc_B) → which is more relevant. More accurate than pointwise (transitivity arbitrage, calibration-free) but
A pairwise reranker takes a query plus two candidate documents and emits a single judgment: which one is more relevant. Its API is model(query, doc_A, doc_B) → p(A > B) — not a score per document, a preference per pair. To actually rerank a list, you run it across many pairs and aggregate.
This sounds like an awkward shape for production retrieval, and it is. The reason pairwise rerankers matter is that they are the most accurate label source the field has — and most production rerankers are trained against their judgments, even when the production model itself is pointwise.
A pointwise reranker emits one number per (query, doc) pair. A pairwise reranker emits one preference per (query, doc_A, doc_B) triple. The two are interconvertible at the output level — given pointwise scores, you can derive pairwise preferences; given pairwise preferences, you can fit a Thurstone model to recover pointwise scores.
But they are not interconvertible at the learning level. Training a pairwise judge is much easier than training a calibrated pointwise scorer:
A pairwise reranker has to learn ordering. A pointwise reranker has to learn ordering plus a calibration. The first task is dramatically easier.
The honest argument for pairwise rerankers is not that they ship better — it is that the labels they produce are better than any pointwise label you could collect directly.
If you ask three frontier LLMs to score (query, doc) pairs on a 0-10 scale, they will return correlated but inconsistently-anchored numbers: one judge’s “7” is another’s “5”, and the same judge drifts across a session. Average them and you get a target that is calibrated to nothing in particular.
If you ask the same three judges “which of A or B is more relevant,” agreement is sharp, position bias can be neutralized by swap-and-re-ask, and the resulting graph of preferences contains strictly more information than the pointwise scores would have — because pointwise scores can be recovered from pairwise via Thurstone, but the reverse direction loses information about how confident the model was on each comparison.
This is the asymmetry that makes pairwise the natural label source for modern reranker training.
The reason a pairwise reranker matters in practice is not that it is the production model — it almost never is — but that it is the teacher in a distillation pipeline. Concretely:
(q, d_A, d_B) → p(A > B).(query, doc) from the sparse pairwise graph. These are the fitted relevance scores — not annotations, recovered statistically from many head-to-head outcomes.The pairwise reranker never ships. It exists in the training pipeline, produces millions of judgments, and disappears. What ships is a pointwise cross-encoder that can be served at one forward pass per (query, doc).
This is the architecture behind zerank-1 and zerank-2 — see zELO for the full pipeline.
It can be, and frontier LLMs are the original label source — but the throughput is wrong for training-scale data. A single Claude or GPT pairwise judgment is 10-50ms and a few cents. Generating ten million pairwise labels for a reranker training run is $100K+ in API cost and weeks of wall time.
The fix is a two-stage distillation. First, an ensemble of frontier LLMs labels a smaller seed set of pairs (~hundreds of thousands). Then a 4B cross-encoder is trained to mimic the ensemble’s pairwise judgments — binary cross-entropy loss on the ensemble probability
The pairwise SLM is the teacher of the production pointwise model, but it is itself the student of the LLM ensemble. Two distillation stages, each one trading a small accuracy loss for a large throughput gain.
The bridge from pairwise outputs to a usable scalar per document is a Thurstone or Bradley-Terry MLE. Given a graph of comparison outcomes, the MLE recovers a continuous latent score per item such that
Two non-obvious properties make this work in production:
The recovered scores become the regression targets for the pointwise student. See Elo and Thurstone for the underlying statistics.
There is a narrow operating point where the pairwise reranker survives all the way to query time:
Outside these niches, pairwise is the teacher and pointwise is what ships.
The pattern holds for listwise rerankers too: more accurate label source, prohibitively expensive at query time, distilled into a pointwise student. Pairwise sits in the middle of that hierarchy — slower than pointwise, faster than full listwise, and the most economically attractive teacher of the three.
Because shipping a pairwise model at query time means
Fit a Thurstone or Bradley-Terry model. The MLE recovers a continuous score per item from the graph of (A beats B) outcomes — same statistical machinery as chess Elo. You don't need dense
Yes — top-k re-rerank stages in cascades. After a cheap pointwise model narrows 100 candidates to 10, running pairwise on those 10 is 45 comparisons, well under a millisecond. The pairwise model handles the hardest tie-breaks where pointwise calibration error matters most. Outside small-