Add Accuracy, Not Latency
Reorders your search candidates so the actual answer beats the lookalike sitting next to it. zerank-2 tops every public reranker leaderboard at 2–3× the speed and a fraction the cost of an LLM doing the same job.
Embeddings find related docs.Rerankers find relevant ones.
First-stage retrieval — BM25, dense embeddings, hybrids — surfaces a few hundred candidates per query. Without a reranker, your top-1 is whatever happened to be closest in vector space. Usually that's an answer-shaped distractor sitting next to an actual answer.
zerank-2 leads the reranker leaderboard across 29 evaluation datasets and three independent LLM judges.
Per rerank call on 12 KB documents — 2–3× faster than Jina and Cohere at superior quality.
Compared to a frontier listwise LLM reranker like GPT-5-mini, at a fraction the cost and higher NDCG@10.
“ZeroEntropy gave us state-of-the-art clinical accuracy across millions of medical research papers — both for simple retrieval and for our Deep Research use case via the MCP server.”
“We replaced our reranker with ZeroEntropy and saw an immediate jump in answer quality on our customer support corpus. The latency was the part we expected to lose; it actually got better.”
“Memory recall accuracy went up meaningfully across our agent benchmarks once we wired zerank-2 into the retrieval path.”
zerank-2: The World's Best Reranker
Cross-encoder reranker, calibrated, multilingual, instruction-following. The numbers below are from the zerank-2 launch evals and the latency assessment under Poisson production load.
- Parameters
- 4B (flagship) · 1.7B small · 0.6B nano
- Architecture
- Cross-encoder · open weights on the 4B
- Context window
- 32K tokens
- Languages
- 100+ · near-English parity on major ones · code-switch robust
- Outputs
- Calibrated 0–1 relevance score + per-call confidence statistic
- Instructions
- Native — append context, abbreviation tables, business rules per call
- Pricing
- $0.025 / 1M tokens — 50% under every other commercial reranker
- NDCG@10 (29 datasets, 3 LLM judges)
- 0.7625 — #1 across public reranker leaderboards
- P50 latency · 12 KB docs
- 149.7 ms — 2–3× faster than Cohere & Jina at higher quality
- Latency tail · Poisson load
- 2.7% over 500 ms · 0.9% over 1 s · 0% over 3 s · zero failures
- vs Cohere rerank-3.5 · >500 ms
- 2.7% vs 14.3%
- vs Jina reranker m0 · >500 ms
- 2.7% vs 70.8%
- vs frontier listwise LLMs
- 12–17× faster at higher NDCG@10 and a fraction of the cost
zELO — train on the easier question
Ask two careful raters “is this document relevant?” and you'll get two different answers — relevance is fuzzy. Ask them “which of these two is more relevant?” and they'll agree. zELO trains exclusively on the easy question, then converts those head-to-head wins into continuous relevance scores. Same math that ranks chess players.
Pointwise scoring is noisy
Ask a rater to score (query, document) on a 0–1 scale and you get a different number every time — different raters disagree, and the same rater either drifts or discretizes hard. The label noise is the ceiling on every reranker trained against it.
Pairwise comparisons are stable
Ask the same rater 'given this query, is document A or B more relevant?' and the answer barely moves across raters or across calls. The information density is much higher per judgment, and the disagreement is much lower.
Recover scores via Thurstone — like chess Elo
Many pairwise outcomes (A beats B, B beats C, A beats C, …) feed a Thurstone fit — the same statistical idea behind chess Elo. Out comes one continuous relevance score per document, calibrated against every comparison we've seen. Those scores are the SFT target.
LLMs as raters, calibrated and mixed (zerank-2)
Frontier LLMs (Claude, GPT, Gemini) are the raters — more consistent than humans and able to work at scale. zerank-2 extends this with a per-rater calibration: we fit a (μ, κ) Beta distribution to each model and iteratively mix them, so each rater's judgment is weighted by how reliable it has actually been.
Read the work behind the model
zELO: ELO-inspired Training Method for Rerankers and Embedding Models
Pipitone, Houir Alami, Avadhanam, Kaminskyi, Khoo
We introduce a novel training methodology named zELO, which optimizes retrieval performance via the analysis that ranking tasks are statically equivalent to a Thurstone model. Trained end-to-end from unannotated queries and documents in less than 10,000 H100-hours, zerank-1 achieves the highest retrieval scores across finance, legal, code, and STEM — outperforming closed-source proprietary rerankers on NDCG@10 and Recall.
Integrate ZeroEntropy models in minutes. Production-ready, latency-optimized, available everywhere.
# Create an API Key at https://dashboard.zeroentropy.dev
from zeroentropy import ZeroEntropy
zclient = ZeroEntropy()
response = zclient.models.rerank(
model="zerank-2",
query="What is Retrieval Augmented Generation?",
documents=[
"RAG combines retrieval with generation...",
],
)
for doc in response.results:
print(doc)Deploy in your own cloud with dedicated infrastructure. Available on AWS Marketplace and Azure.
From security to scale, ZeroEntropy is built for the demands of production ready AI

SOC2 Type II
Audited controls for data security, availability, and confidentiality — verified annually.

HIPAA Compliant
BAA-ready infrastructure with encryption at rest and in transit for protected health data.

GDPR Compliant
Full data residency controls, right-to-deletion, and DPA agreements for EU customers.

CCPA Compliant
Consumer data rights honored with full transparency on collection, use, and deletion.
