Back

Should You Use an LLM as a Reranker? Pros, Cons, and Benchmarks

Jul 20, 2025 ·

Should You Use an LLM as a Reranker? Pros, Cons, and Benchmarks

TL;DR

LLM reranking can improve RAG quality through cross-document reasoning, but the economics are harsh. Pointwise LLM reranking is almost never worth it—10x the cost and lower accuracy than specialized rerankers. Listwise reranking shows modest gains (0.78 vs 0.74 NDCG@10) but at 9x cost and 35x latency. The pragmatic approach: a hybrid pipeline where a specialized reranker narrows candidates and an LLM does final listwise reranking on only the top-10 results.

Introduction

LLM Reranking in RAG: A Pragmatic Cost-Benefit Analysis

Reranking quality often determines whether your RAG pipeline succeeds or fails. As LLMs become cheaper and more capable, many engineers wonder: should we replace our specialized rerankers with LLMs?

After benchmarking Gemini Flash across 17 retrieval datasets and deploying LLM-based reranking in production, here’s what we’ve learned about when it makes sense and when it doesn’t.

The Economics of Pointwise LLM Reranking

Pointwise reranking treats each query-document pair independently, scoring them one at a time. This mirrors how cross-encoders work, but with fundamentally different economics.

The math doesn’t work out. Consider a typical setup with k=75 candidates:

Latency: 75 sequential LLM calls per query. Even with Gemini Flash at ~200ms per call, you’re looking at 15+ seconds end-to-end. Parallelization helps but introduces orchestration complexity and still bottlenecks on rate limits.
Cost: With 500 tokens per query-document pair at $0.50/1M input tokens, you’re paying $18.75 per 1,000 queries just for inputs. Add output tokens for scores and explanations, and you’re approaching $25-30 per 1,000 queries. A specialized reranker like Cohere Rerank costs $2/1k queries—roughly 10x cheaper.
Accuracy: Our benchmarks show Gemini Flash averaging 0.68 NDCG@10 across datasets, compared to 0.74 for purpose-built rerankers like BGE-reranker-v2. LLMs produce unstable scores that vary between runs even with temperature=0, making threshold-based filtering unreliable.

Listwise Reranking: The Only Compelling Use Case

Listwise reranking flips the paradigm. Instead of scoring documents independently, you provide the LLM with the query and all k candidates in a single context window, asking it to produce a ranked ordering.

This approach unlocks capabilities that traditional rerankers can’t match:

Cross-document reasoning: An LLM can identify that Document A provides background context while Document B directly answers the question, even if both score similarly on surface-level relevance. Cross-encoders see one document at a time and can’t make these comparisons.

Flexible ranking criteria: You can adapt ranking logic in natural language without retraining models. “Prioritize recent sources,” “prefer academic papers over blog posts,” or “rank by completeness of answer” become prompt modifications rather than model architecture changes.

Deduplication and complementarity: An LLM can recognize when two highly-ranked documents are near-duplicates and demote one, or identify when documents complement each other and should appear together.

A Hybrid Strategy That Actually Works

In production, we’ve found success with a staged approach:

Initial retrieval

Use a fast embedding model to pull top-200 candidates (Sub-100ms, effectively free)

First-stage reranking

Apply a specialized cross-encoder to narrow to top-20 (5-10ms, $0.50/1k queries)

LLM listwise reranking

Use an LLM to produce final ordering of top-10 (200-500ms, $5-10/1k queries depending on document length)

This gives you the best of both worlds: the cross-document reasoning of LLM reranking where it matters most (the final results the user sees), while keeping costs and latency manageable by limiting the LLM’s workload.

Prompt caching: If your queries share common structure, cache the instruction portion to reduce effective token counts by 30-50%
Document summarization: Compress each document to 100-200 tokens before sending to the reranker, especially if you’re ranking on relevance rather than completeness
Batch processing: For offline pipelines or async workflows, batch multiple queries together to amortize overhead
Smaller fine-tuned models: Fine-tune a 7B model specifically for your domain’s reranking task. We’ve seen 8x cost reduction with comparable quality to GPT-4 on domain-specific corpora.

Benchmark Results: Setting Expectations

We evaluated Gemini Flash against BGE-reranker-v2 and Cohere Rerank across 17 datasets spanning e-commerce, legal documents, technical documentation, and news articles.

Pointwise results (NDCG@10):

Model	NDCG@10	Median Latency	Cost per 1k queries
BGE-reranker-v2	0.74	12ms	$2
Gemini Flash	0.68	185ms	$27

Listwise results (NDCG@10, top-20 candidates):

Model	NDCG@10	Median Latency	Cost per 1k queries
BGE-reranker-v2	0.74	—	$2
Gemini Flash	0.78	420ms	$18

When to Use What

Use a specialized reranker (BGE, Cohere, jina-reranker) when:

You need low latency (<50ms) and high throughput
Your budget is constrained and you’re processing high query volumes
Your ranking criteria are stable and can be captured in training data
You need consistent, calibrated scores for downstream filtering

Use LLM listwise reranking when:

You’re in a low-QPS, high-value domain (legal research, medical literature review, compliance)
Cross-document reasoning materially improves result quality
You need flexible ranking criteria that change frequently
You can afford 500ms-2s latency and $10-100 per 1,000 queries
You’re implementing a hybrid pipeline where the LLM only reranks top-10 results

Avoid LLM pointwise reranking unless:

You specifically need natural language explanations for each relevance score
You’re in research/experimentation mode and cost doesn’t matter

Related Blogs

Catch all the latest releases and updates from ZeroEntropy.

Apr 02, 2026

Smarter Context Compression for LLM Pipelines: zerank-2 as a Calibrated Classifier

How to use zerank-2's calibrated relevance scores as a binary classifier for context compression, document routing, and multi-label classification — at 50-100x less cost than LLM classification.

Mar 02, 2026

"Let's eat, grandma" vs "let's eat grandma": how embedding models encode the world

A deep dive into how embedding models encode meaning, why famous training examples create the illusion of capability, and what consistent behavior across 10k+ nouns tells us about genuine understanding.

Feb 23, 2026

2026's Top 10 Embedding Companies Powering Search Technology

The best AI teams retrieve with ZeroEntropy

Book Demo View docs