Cascade Rerankers

Also known as: multi-stage reranker, tiered reranking, ranking cascade

TL;DR

A cascade reranker stacks multiple rerankers from cheap-and-fast to expensive-and-accurate, with each stage filtering candidates before passing a smaller set to the next.

A cascade reranker is a pipeline of two or more arranged cheapest to most expensive, where each stage prunes candidates before the next. The economic logic: a cross-encoder is too slow on 1000 candidates per query but plenty fast on 100, and a bi-encoder rerank can knock the set from 1000 to 100 cheaply. Stack them and you get cross-encoder precision at a fraction of the cost.

CASCADE RERANKERSCheap filters first → expensive picks lastN=1000first-passSTAGE 1 · CHEAPbi-encoder rerank0.05 ms / pair50 msin 1000out 100STAGE 2 · MIDsmall cross-encoder2 ms / pair200 msin 100out 20STAGE 3 · EXPENSIVElarge cross-encoder · LLM50 ms / pair1.0 sin 20out 5TOP-5final rankingLATENCY BUDGETΣ ≈ 1.25 s50ms + 200ms + 1.0s1000 candidates enter the cascade.

The classic two-stage cascade

The default production shape is (BM25 or ) returning ~500 candidates, followed by a reranker reducing to ~20-50. This already counts as a cascade — first-pass plays the role of cheap stage, cross-encoder is the precision stage. It’s how essentially every modern RAG pipeline ships.

Adding a third stage (a rerank between first-pass and cross-encoder) is worth it when first-pass surfaces 1000+ candidates and the cross-encoder budget tops out around 100. The bi-encoder rerank is fast — milliseconds per 1000 candidates — and a well-trained one preserves nearly all the relevant documents.

Why this beats one big model

You could conceptually run one giant reranker on every candidate. The problem is that cross-encoder latency scales linearly in candidate count. At 50ms per pair, 500 pairs is 25 seconds — unshippable. The cascade exploits the fact that most candidates are obvious negatives that a cheap model can filter, and only the hard cases need the expensive model’s discrimination.

Designing the cut-offs

Three forces: upstream recall (you want it high), downstream throughput (capacity ceiling), and downstream marginal accuracy (does seeing more candidates actually improve final ranking?).

Procedure: take a labeled eval set of 100-500 queries. For each candidate count , measure recall@k of stage 1. Pick the smallest where recall stops rising. That’s your stage-1 cut-off. Repeat for stage 2 conditional on stage 1’s output. The downstream cross-encoder typically extracts most of its accuracy from the top-50; beyond that, returns diminish.

Common cut-offs we see in production:

  • BM25 → 1000 → bi-encoder rerank → 100 → cross-encoder → 20 → LLM
  • Hybrid → 500 → cross-encoder → 30 → final
  • Dense → 200 → cross-encoder → 10 → final (cheapest stack)

Latency and cost shape

Examples
  • A 500-candidate cross-encoder rerank at 50ms/pair: 25s total. Unshippable.
  • Same 500 candidates filtered to 50 by a 1ms/pair bi-encoder, then cross-encoded: 0.5s + 2.5s = 3s. Acceptable.
  • 1000-candidate first-pass → 100 by bi-encoder rerank → 20 by cross-encoder → 5 by LLM-as-judge. Final stage sees only 5 candidates; LLM cost is bounded.
  • Tail-latency note: cascades have N stages of . P99 compounds roughly multiplicatively.

Cascades and score calibration

When stages are independent models, their score scales aren’t comparable. Fusing them via reciprocal rank fusion sidesteps this; fusing via weighted sum requires on each stage. Models like zerank-2 emit calibrated probabilities specifically so cascades downstream of them compose cleanly.

The other consequence: instruction-following rerankers like the are usually placed at the precision stage of the cascade, where the smaller candidate set means each instruction-conditioned forward pass is affordable. Putting an instruction-following model at stage 1 burns budget on candidates that any reranker would reject.

What to ship

For most production stacks: first-pass + a single calibrated cross-encoder. Add a bi-encoder mid-stage only when first-pass recall is wide (1000+) and the cross-encoder’s per-pair cost is high. Skip three stages if two get you there — every stage is one more recall risk and one more thing to monitor.

Go further

How many stages should a cascade have?

Two is the common shape — first-pass to ~500, cross-encoder to ~50, LLM picks final. Three stages add a cheap bi-encoder rerank (BM25 to 1000, bi-encoder to 200, cross-encoder to 30) when first-pass recall is wide. Four-plus rarely pays off — each stage adds latency and a recall risk.

Where do you set the cut-off between stages?

Tune by measuring recall@k of the upstream stage and accuracy@k of the downstream stage on your eval set. Pick the smallest upstream k where recall plateaus, then verify the downstream model has capacity to rerank that many candidates within latency budget. Typical: 1000 to 100 to 10.

Does score calibration matter in a cascade?

Yes — if you fuse scores across stages (e.g., reciprocal rank fusion or weighted sum), uncalibrated stage outputs make the fusion arbitrary. Calibrated rerankers like zerank-2 produce probability-shaped scores that compose cleanly across stages.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord