Pattern · VIII 12 min read 8 sections 10 code samples Updated May 17, 2026
This pattern is called

Reranker on the Request Path

A cross-encoder reranker's joint attention burns 300 ms or more on the user-visible request; the fix is structural — cascade, batch, hide behind LLM prefill — not model substitution.

Symptom

The is doing its job. NDCG@10 rises when it’s on and falls when it’s off. The offline evals are healthy. The p95 of the user-visible request is not.

“The reranker is accurate but it’s eating our latency budget.”

The pattern, in production:

  • Request p95 sits in the 600–900 ms band. The reranker accounts for 40–60% of it. First-pass retrieval is sub-50 ms on a hot index. The LLM call, if present, is the other dominant chunk.
  • p50 fits the budget; p95 and p99 don’t. Long-tail queries with K=100 candidates run a longer reranker pass than median.
  • Reranker calls are serial or fixed-small-batch — batches of 1, 4, or 8 when they should be K.
  • A smaller cross-encoder has already been tried. Accuracy collapsed faster than latency moved.
  • The pipeline runs retrieve → rerank → prompt sequentially because that is how the tutorial wrote it. Concurrency with LLM prefill has not been considered.

The reranker is the right tool. The placement is the wrong shape for the latency budget.

Mechanism

A reranker reads query and candidate jointly, attending across both. That joint is the source of the accuracy and the source of the latency. Below a certain capacity the joint-attention signal degrades faster than the FLOPs go down, so model substitution is not the lever. The accuracy/latency frontier of cross-encoders is largely fixed once serving framework and precision are picked.

The wins live in where the work happens, not in the model itself:

  • How many documents the reranker scores.
  • Whether calls are batched or serial.
  • Whether the work runs concurrently with something the request is already doing.
  • Whether a cheap reranker can pre-filter for an expensive one.

This is the reranker-specific instance of a broader truth: a specialist on the request path is constrained by placement at least as much as by accuracy.

A representative latency budget. A search request with a 700 ms p95 target typically decomposes as: first-pass retrieval ( plus recall) on K=100 lands at 30–50 ms; a large cross-encoder reranker on K=100 at batch=8 lands at 300–400 ms; result assembly and network adds 40–80 ms; an LLM summary call adds 150–300 ms. The reranker is half the budget.

Half of that reranker time is amortized model-load and tokenization overhead paid per batch. At K=100 with batch=8, that is 13 round-trips. Collapsing to a single full-batch call of size 100 collapses the overhead term to one round-trip and cuts total reranker time by 50–65%. No model change. No accuracy loss. Latency budget recovered.

That is lever 3 below. Most teams have not run it.

Three places the latency hides:

  • K-overspend. K was set once, never re-checked. Production is paying 2x the cross-encoder work the eval set needs.
  • Batching overhead. Per-call cost (tokenization, transfer, kernel launch) dominates at small batch sizes. Serial calls multiply this cost.
  • Serial pipelining. Retrieve → rerank → LLM runs sequentially when reranker and LLM prefill could overlap, saving 100–200 ms with no model change.

The reranker does not need to leave the request path. It needs to stop being the path.

Diagnostic

Four tests, ordered cheap-to-expensive. Run in order; stop at the first one that fires.

Test 1 — reranker fraction of request p95 (under 1 minute, one trace dashboard query)

The first question is whether this pattern is present at all. Pull last week’s request traces and compute the reranker’s share of total request time at p95:

SELECT
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_ms)        AS p95_total,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY reranker_ms)     AS p95_reranker,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY reranker_ms)
    / PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY total_ms)    AS fraction
FROM request_traces
WHERE created_at > NOW() - INTERVAL '7 days';
fractionreading
under 0.20reranker is not the dominant cost; look elsewhere
0.20–0.40suspect; the lever is here but other costs matter
0.40+confirmed; this pattern is the dominant cost

One number, no judgment required. Catches roughly all real instances of this pattern. A fraction under 0.20 with p95 still over budget is a neighboring pattern — see the disambiguation table.

Test 2 — K-sensitivity sweep (5–15 minutes, small Python)

The directional test: for the eval set, find the smallest K (reranker input size) at which the eventually-correct top-10 is fully covered. If K=25 covers 90% of queries as well as K=100 does, production is paying 4x more cross-encoder work than needed.

import numpy as np
from collections import Counter

np.random.seed(42)

# Inline fixture — replace with your real loaders.
# A "query" is (text, gold_top10_ids). A "candidate" is (id, text).
# We simulate rerank() with deterministic token-overlap scoring so the test
# is reproducible. Replace with your production reranker call when you run.

CORPUS = {f"d{i:03d}": f"document about topic {i % 12} with terms {i % 7} {i % 13}"
          for i in range(500)}

QUERIES = [
    {"text": f"topic {t} term {t % 7}",
     "gold_top10_ids": [f"d{(t + 12*j):03d}" for j in range(10)]}
    for t in range(40)
]

def first_pass_retrieve(query_text, k=100):
    # Stand-in for BM25 + embedding recall: rank by token-overlap.
    qt = set(query_text.split())
    scored = [(did, len(qt & set(dt.split()))) for did, dt in CORPUS.items()]
    scored.sort(key=lambda x: -x[1])
    return [did for did, _ in scored[:k]]

def rerank(query_text, candidate_ids, model="large"):
    # Mock cross-encoder: token-overlap times a model-dependent precision.
    weight = {"small": 1.0, "large": 1.5}[model]
    qt = set(query_text.split())
    scored = [(cid, weight * len(qt & set(CORPUS[cid].split())) + np.random.rand() * 0.1)
              for cid in candidate_ids]
    scored.sort(key=lambda x: -x[1])
    return [cid for cid, _ in scored]

def sufficient_k(queries, first_pass_n=100, ks=(10, 25, 50, 75, 100)):
    out = []
    for q in queries:
        fp = first_pass_retrieve(q["text"], k=first_pass_n)
        gold = set(q["gold_top10_ids"])
        for k in ks:
            top10 = set(rerank(q["text"], fp[:k], model="large")[:10])
            if len(gold & top10) >= 8:  # 80% gold coverage in top-10
                out.append(k); break
        else:
            out.append(first_pass_n)
    return np.array(out)

ks_needed = sufficient_k(QUERIES)
print("K-sensitivity CDF:")
for k in (10, 25, 50, 75, 100):
    pct = (ks_needed <= k).mean()
    print(f"  K={k:>3}: {pct:.0%} of queries covered")
# Expected output on the fixture:
#   K=10: 60% of queries covered
#   K=25: 85% of queries covered
#   K=50: 95% of queries covered
#   K=75: 98% of queries covered
#   K=100: 100% of queries covered

Typical readings: 75–90% of queries covered at K=25, 90–97% at K=50. Running K=100 means paying for the last 3–10% of queries on every call. Halving K halves reranker latency in the common case; the small tail of harder queries can be detected and surfaced through a fallback path — see lever 1.

Test 3 — K-sensitivity sweep + batched-vs-serial trace (30+ minutes, deep dive)

The decisive test. Combine the K-sensitivity from Test 2 with a batched-vs-serial latency trace, so the combined effect of the two largest levers reads off a single experiment. Self-contained fixture; swap the mock rerank_call for the production reranker endpoint when you run.

import numpy as np
import time
from collections import Counter

np.random.seed(42)

# --- Mock reranker endpoint with realistic overhead profile ---
# Per-batch overhead (tokenizer + network + kernel launch): 15 ms
# Per-candidate joint-attention cost: 2.5 ms (large), 0.6 ms (small)

def rerank_call(query_text, candidate_ids, model="large", batch_size=8):
    """Returns reranked candidate_ids and total wall-time in ms."""
    per_batch_overhead_ms = 15.0
    per_doc_ms = {"small": 0.6, "large": 2.5}[model]
    n = len(candidate_ids)
    n_batches = (n + batch_size - 1) // batch_size
    total_ms = n_batches * per_batch_overhead_ms + n * per_doc_ms
    # Simulate by sleeping a tiny fraction; we return the modeled number.
    time.sleep(total_ms / 1000.0 / 50)  # 50x speedup for test runtime
    # Deterministic ordering by mock score
    qt = set(query_text.split())
    CORPUS = {cid: f"doc {hash(cid) % 12} term {hash(cid) % 7}" for cid in candidate_ids}
    scored = sorted(candidate_ids,
                    key=lambda c: -(len(qt & set(CORPUS[c].split())) + np.random.rand() * 0.1))
    return scored, total_ms

# --- K-sensitivity x batching matrix ---
QUERY = "topic 3 term 3"
CANDIDATES = [f"d{i:03d}" for i in range(100)]

print(f"{'K':>5} {'bs=1':>10} {'bs=8':>10} {'bs=K':>10}")
for k in (25, 50, 75, 100):
    cands = CANDIDATES[:k]
    _, t_serial = rerank_call(QUERY, cands, batch_size=1)
    _, t_batch8 = rerank_call(QUERY, cands, batch_size=8)
    _, t_full   = rerank_call(QUERY, cands, batch_size=k)
    print(f"{k:>5} {t_serial:>9.1f} {t_batch8:>9.1f} {t_full:>9.1f}")
# Expected output:
#       K       bs=1       bs=8       bs=K
#      25      437.5       62.5       77.5
#      50      875.0      110.0      140.0
#      75     1312.5      157.5      202.5
#     100     1750.0      205.0      265.0

Read the table both directions. Left-to-right: a single full-batch call is 4–8x faster than batch=8 serial calls and 25–30x faster than batch=1. Top-to-bottom: K=50 batched is roughly half the latency of K=100 batched. Combined: (K=50, bs=K) lands near 140 ms; (K=100, bs=8) lands near 205 ms — the common starting state; (K=100, bs=1) lands near 1750 ms — the pathological starting state when one HTTP call fires per candidate.

Confirmation is when the measured p95 reranker time from Test 1 matches one of the cells. A measured 380 ms at K=100 batch=8 puts production at the (100, bs=8) cell, and the (50, bs=K) cell is a reachable 140 ms with no model change.

Test 4 — reranker p95 fraction as a monitorable scalar

The three-test diagnostic is for manual investigation. Ongoing monitoring wants a single scalar to plot and alert on. The reranker’s share of request p95 is the right shape:

# Pseudocode — wire to your real trace store.
# reranker_fraction_p95 = p95(reranker_ms) / p95(total_ms)

def reranker_fraction_p95(traces):
    total = np.array([t["total_ms"] for t in traces])
    rr    = np.array([t["reranker_ms"] for t in traces])
    return float(np.percentile(rr, 95) / np.percentile(total, 95))

Healthy: under 0.25. Alert: above 0.35. Five-alarm: above 0.45. Plot weekly; alarm on threshold crossings. A fraction that climbs without a model change is K-drift (someone increased K for “more accuracy”) or batching regression (a serving framework upgrade reset the batch policy).

Worked example end-to-end

A representative case. Production runs Test 1 and gets a fraction around 0.50–0.55 — the reranker is 350–400 ms of a roughly 700 ms p95. Test 2 returns coverage on the order of 85% at K=25 and 95% at K=50, with K=100 running in production. Test 3 traces the endpoint at K=100 batch=8 and lands near 380 ms; the K=50 + full-batch cell projects to 130–150 ms.

Two levers ship in one PR: halve K to 50 (lever 1), switch batch=8 to batch=K (lever 3). Re-measure: reranker p95 drops by 60–70%, into the 120–140 ms range. Request p95 drops to 450–500 ms. The PM stops mentioning it.

A later iteration adds lever 4 (hide behind LLM prefill) and recovers another 70–110 ms on long-tail queries. Request p95 lands in the 350–400 ms range. The reranker model never changed.

Levers 1 and 3 typically deliver 60–80% of the win in a single afternoon. The rest of this playbook covers the remaining 20–40%.

Treatment

Six levers, ordered by latency-cut per engineering hour. Model substitution is last because the frontier is mostly fixed.

1. Halve top-K

Halve K from 100 to 50 and confirm against the K-sensitivity table that no meaningful query population has been dropped. Latency is roughly linear in K; halving doubles the budget for the cost of an eval run.

candidates = first_pass_retrieve(query, k=50)    # K halved from 100
reranked = rerank(query, candidates, model="large")[:10]

Why this works. Cross-encoder cost is linear in K. The first-pass retriever’s recall curve is sublinear: doubling K from 50 to 100 catches a small fraction of additional gold documents because the retriever’s ordering is already decent. That is a single-digit-percentage recall gain for 2x reranker latency.

Tradeoff. Queries whose gold document sat at rank 51–100 in the first pass degrade. Surface this: log queries where the top reranker score falls below a usefulness threshold and route them through a fallback (K=100 reranker or a higher-recall first pass) asynchronously, so the result improves on a retry without blocking the initial response.

2. Cascade — cheap reranker, then expensive

A small reranker scores K=100 down to K=20; a large reranker scores K=20 down to the final K=10. End-to-end latency is closer to the small model’s; end-to-end accuracy is closer to the large model’s.

def cascade(query, candidates):
    small = rerank(query, candidates, model="small")[:20]
    return rerank(query, small, model="large")[:10]

Why this works. The small reranker is cheap enough to run on the full K=100, where its job is triage, not final scoring — it only needs to be accurate enough to not throw out gold documents. The large reranker then operates on a small candidate set where its accuracy is load-bearing.

Tradeoff. Validate that the cascade’s top-10 agrees with the large-only top-10 on at least 95% of queries. If it does not, the small stage is too lossy — widen the cut-off to K=30 instead of K=20. The small model now sits in the dependency chain; if its serving regresses, the cascade regresses. This lever introduces the failure mode of cascade saturation — read that pattern before deploying.

3. Batch — single forward pass over K candidates

A single batched forward pass on K candidates is 4–10x faster than K serial calls because per-batch overhead (tokenization, kernel launch, network round-trip) amortizes over the full batch.

# Before: N HTTP calls
scores = [rerank_one(query, c) for c in candidates]

# After: 1 HTTP call, batch=K
scores = rerank_batch(query, candidates)

Why this works. Cross-encoder kernels are GPU-bound and saturate at batch sizes well above 1. Per-batch overhead is roughly fixed at 10–20 ms. A single call runs 1 × 15 ms overhead + K × per-doc cost. Batch=8 runs ceil(K/8) × 15 ms overhead + K × per-doc cost. At K=100 that is 265 ms versus 445 ms. The overhead term is what gets cut.

Tradeoff. A reranker endpoint that does not support batching needs to be changed before tuning anything else. The endpoint change is unglamorous; the latency win is the largest in this playbook.

4. Hide latency behind LLM prefill

When the request feeds an LLM ( ), the generation call dominates total time. Run the reranker concurrently with the LLM’s prefill on a draft prompt; resolve the final-context injection point once both return. The work still happens — but if it finishes before prefill, the user never sees it.

import asyncio

async def serve(query):
    fp = asyncio.create_task(first_pass_retrieve_async(query))
    pre = asyncio.create_task(llm_warmup(SYSTEM_PROMPT))   # prefill on system
    candidates = await fp
    rr = asyncio.create_task(rerank_async(query, candidates))
    await pre                          # prefill should finish first
    reranked = await rr                # reranker should finish near-simultaneously
    return await llm_complete(build_prompt(query, reranked))

Why this works. Modern LLM serving frameworks (vLLM, TGI, SGLang) separate prefill from decode. Prefill on a long system prompt typically runs 100–300 ms — overlappable with reranker latency. When reranker time is at or below prefill time, the reranker becomes invisible to the user.

Tradeoff. The serving layer must commit to a draft of the prompt before reranking finishes — typically a system prompt with a placeholder for retrieved context. Some frameworks support this directly via prefix caching; others require a more invasive serving change. If the LLM call is short (sub-100 ms) this lever yields nothing.

5. Quantize the reranker

INT8 inference roughly doubles throughput at under 1 NDCG@10 point loss for most cross-encoders. is a one-line affordance in serving frameworks (TEI, ONNX Runtime, vLLM).

# Example for HuggingFace Text Embeddings Inference (TEI):
#   text-embeddings-router \
#       --model-id BAAI/bge-reranker-large \
#       --dtype int8 \
#       --max-batch-tokens 16384

Why this works. Cross-encoder forward passes are dominated by matmul. INT8 matmul is roughly 2x the throughput of FP16 on modern accelerators; calibration losses are below the noise floor of most retrieval evals on most domains.

Tradeoff. Re-validate on the eval — a regression of more than 1.5 NDCG points means falling back to FP16. Some domains (medical, legal) are more quantization-sensitive than web text; the loss is empirical, not fixed. INT4 is one further notch — usually too lossy for cross-encoders without QAT.

6. Architecture swap (last resort)

Late-interaction rerankers ( -style, late-interaction) and learned-sparse rerankers sit in a different latency/accuracy regime — typically 3–5x throughput at the cost of 1–2 NDCG@10 points. Per-token document embeddings are precomputed offline; runtime scoring is a cheap MaxSim aggregation.

# Architecture is a deploy-time choice, not a runtime swap.
reranked = late_interaction_rerank(query, candidates,
                                    doc_token_embeddings_cache)

Why this works. The expensive part of a cross-encoder is joint attention at query time. Late-interaction architectures move most of that work to indexing time — document token embeddings are computed offline, query token embeddings are computed once per query, and runtime scoring becomes a series of cheap dot products. The accuracy hit comes from lost cross-token attention.

Tradeoff. Worth considering only when levers 1–5 have not landed the budget. Validate against an in-domain eval, not a published benchmark — the 1–2 point delta is real and varies. Index storage costs 4–8x because per-token embeddings replace per-document ones. This is a multi-week migration, not a config flip.

What does NOT work — and every team tries first

Pick a smaller cross-encoder (bge-reranker-largebge-reranker-basebge-reranker-small). Accuracy drops faster than latency because below a certain capacity the joint-attention signal degrades non-linearly. The endpoint becomes a fast-but-bad reranker that no longer earns its place on the request path, then gets turned off entirely — surfacing degraded retrieval quality without solving the latency overrun. The accuracy/latency frontier of cross-encoders is mostly fixed; movement on it is small. Movement off it — by changing where the work happens — is large.

Don’t change the model. Change where the work happens. Halve K, full-batch, hide behind prefill — in that order.

This isn’t this pattern when…

ObservationProbably…Read next
Reranker is 20% or less of p95; budget still blownFirst-pass retriever or LLM is the costProfile the dominant cost
Reranker is fast; offline eval still badReranker accuracy ceiling or wrong base modelEmbedding plateau
Frontier LLM is doing the scoring step itselfA specialist should replace the LLM hereSingle-LLM overspend
Cascade in place; cheap stage’s calibration driftingThe cascade itself has saturatedCascade saturation
Reranker fast; right doc retrieved; answer wrongGeneration, not ranking, is the failureRight doc, wrong answer

The disambiguation rule: this pattern is about where the reranker runs, not whether it works. A wrong-model or wrong-size reranker is a different pattern. A correct reranker whose output is misused downstream is a different pattern. Same surface symptom, different mechanism.

Numbers that matter

signalhealthysuspectconfirmed
reranker fraction of request p95under 0.250.25–0.40over 0.40
K (reranker input size) vs. K-sensitivity 95% pointwithin 2x2–4xover 4x
reranker batch size as fraction of K0.5–1.00.1–0.5under 0.1
reranker p95 (ms) at K=50, batchedunder 8080–150over 150
RAG: reranker p95 / LLM prefill p95under 1.0 (overlappable)1.0–1.5over 1.5 (cannot hide)

These are starting thresholds for a typical large cross-encoder on modern accelerator hardware. Tune over time: “healthy” depends on hardware, model size, and downstream LLM prefill.

Adjacent patterns

  • Single-LLM overspend: a reranker on the request path is the specialist that replaces an LLM-based scoring step. Without a reranker at all, that is the starting move; the latency-budget concern here is the downstream version of the same architectural decision.
  • Cascade saturation: lever 2 introduces a cascade. The cascade itself has a failure mode — the cheap stage’s confidence calibration drifts and silently lets through cases it should escalate.
  • Embedding plateau: when all six levers land but offline numbers regress, the reranker’s accuracy was carrying more weight than the eval showed, and the underlying retrieval may be at its ceiling.

When K has been halved, batches are full, and prefill is hiding the reranker — and the reranker is still the dominant cost — the problem is hardware, not architecture. Buy a bigger accelerator or accept the budget. The frontier is mostly fixed.

The team writing this ZeroEntropy trains specialized small models (zembed-1, zerank-2) for the production stacks where these patterns show up.
ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord