LLM reranking can improve RAG quality through cross-document reasoning, but the economics are harsh. Pointwise LLM reranking is almost never worth it—10x the cost and lower accuracy than specialized rerankers. Listwise reranking shows modest gains (0.78 vs 0.74 NDCG@10) but at 9x cost and 35x latency. The pragmatic approach: a hybrid pipeline where a specialized reranker narrows candidates and an LLM does final listwise reranking on only the top-10 results.
Introduction
LLM Reranking in RAG: A Pragmatic Cost-Benefit Analysis
Reranking quality often determines whether your RAG pipeline succeeds or fails. As LLMs become cheaper and more capable, many engineers wonder: should we replace our specialized rerankers with LLMs?
After benchmarking Gemini Flash across 17 retrieval datasets and deploying LLM-based reranking in production, here’s what we’ve learned about when it makes sense and when it doesn’t.
The Economics of Pointwise LLM Reranking
Pointwise reranking treats each query-document pair independently, scoring them one at a time. This mirrors how cross-encoders work, but with fundamentally different economics.
The math doesn’t work out. Consider a typical setup with k=75 candidates:
- Latency: 75 sequential LLM calls per query. Even with Gemini Flash at ~200ms per call, you’re looking at 15+ seconds end-to-end. Parallelization helps but introduces orchestration complexity and still bottlenecks on rate limits.
- Cost: With 500 tokens per query-document pair at $0.50/1M input tokens, you’re paying $18.75 per 1,000 queries just for inputs. Add output tokens for scores and explanations, and you’re approaching $25-30 per 1,000 queries. A specialized reranker like Cohere Rerank costs $2/1k queries—roughly 10x cheaper.
- Accuracy: Our benchmarks show Gemini Flash averaging 0.68 NDCG@10 across datasets, compared to 0.74 for purpose-built rerankers like BGE-reranker-v2. LLMs produce unstable scores that vary between runs even with temperature=0, making threshold-based filtering unreliable.
Listwise Reranking: The Only Compelling Use Case
Listwise reranking flips the paradigm. Instead of scoring documents independently, you provide the LLM with the query and all k candidates in a single context window, asking it to produce a ranked ordering.
This approach unlocks capabilities that traditional rerankers can’t match:
Cross-document reasoning: An LLM can identify that Document A provides background context while Document B directly answers the question, even if both score similarly on surface-level relevance. Cross-encoders see one document at a time and can’t make these comparisons.
Flexible ranking criteria: You can adapt ranking logic in natural language without retraining models. “Prioritize recent sources,” “prefer academic papers over blog posts,” or “rank by completeness of answer” become prompt modifications rather than model architecture changes.
Deduplication and complementarity: An LLM can recognize when two highly-ranked documents are near-duplicates and demote one, or identify when documents complement each other and should appear together.
A Hybrid Strategy That Actually Works
In production, we’ve found success with a staged approach:
Initial retrieval
Use a fast embedding model to pull top-200 candidates (Sub-100ms, effectively free)
First-stage reranking
Apply a specialized cross-encoder to narrow to top-20 (5-10ms, $0.50/1k queries)
LLM listwise reranking
Use an LLM to produce final ordering of top-10 (200-500ms, $5-10/1k queries depending on document length)
This gives you the best of both worlds: the cross-document reasoning of LLM reranking where it matters most (the final results the user sees), while keeping costs and latency manageable by limiting the LLM’s workload.
- Prompt caching: If your queries share common structure, cache the instruction portion to reduce effective token counts by 30-50%
- Document summarization: Compress each document to 100-200 tokens before sending to the reranker, especially if you’re ranking on relevance rather than completeness
- Batch processing: For offline pipelines or async workflows, batch multiple queries together to amortize overhead
- Smaller fine-tuned models: Fine-tune a 7B model specifically for your domain’s reranking task. We’ve seen 8x cost reduction with comparable quality to GPT-4 on domain-specific corpora.
Benchmark Results: Setting Expectations
We evaluated Gemini Flash against BGE-reranker-v2 and Cohere Rerank across 17 datasets spanning e-commerce, legal documents, technical documentation, and news articles.
Pointwise results (NDCG@10):
| Model | NDCG@10 | Median Latency | Cost per 1k queries |
|---|---|---|---|
| BGE-reranker-v2 | 0.74 | 12ms | $2 |
| Gemini Flash | 0.68 | 185ms | $27 |
Listwise results (NDCG@10, top-20 candidates):
| Model | NDCG@10 | Median Latency | Cost per 1k queries |
|---|---|---|---|
| BGE-reranker-v2 | 0.74 | — | $2 |
| Gemini Flash | 0.78 | 420ms | $18 |
When to Use What
Use a specialized reranker (BGE, Cohere, jina-reranker) when:
- You need low latency (<50ms) and high throughput
- Your budget is constrained and you’re processing high query volumes
- Your ranking criteria are stable and can be captured in training data
- You need consistent, calibrated scores for downstream filtering
Use LLM listwise reranking when:
- You’re in a low-QPS, high-value domain (legal research, medical literature review, compliance)
- Cross-document reasoning materially improves result quality
- You need flexible ranking criteria that change frequently
- You can afford 500ms-2s latency and $10-100 per 1,000 queries
- You’re implementing a hybrid pipeline where the LLM only reranks top-10 results
Avoid LLM pointwise reranking unless:
- You specifically need natural language explanations for each relevance score
- You’re in research/experimentation mode and cost doesn’t matter
