Back

Latency Benchmark: Cohere rerank 3.5 vs. ZeroEntropy zerank-1

Jul 22, 2025 ·

Latency Benchmark: Cohere rerank 3.5 vs. ZeroEntropy zerank-1

Speed is the secret ingredient that makes great AI feel instant

What is a reranker and why you need one

A reranker is a cross-encoder neural model that takes a short list of candidate documents from a fast first-stage search (BM25, vector search or hybrid) and rescoring them with full query–document context. This second-pass step dramatically boosts precision in your top-k results, ensuring your LLM or user sees the most relevant snippets first.

Diagram illustrating the reranking pipeline

Benchmark results

Model	NDCG@10	Latency (12 KB)	Latency (150 KB)
Cohere rerank 3.5	0.7091	171.5 ms ± 106.8	459.2 ms ± 87.9
ZeroEntropy zerank-1	0.7683	149.7 ms ± 53.1	314.4 ms ± 94.6

zerank-1 is:

~12 % faster than Cohere 3.5 on small payloads (149.7 ms vs 171.5 ms)
~31 % faster on large payloads (314.4 ms vs 459.2 ms)

All while delivering the highest NDCG@10.

Why speed matters

Whether you’re powering an enterprise search portal or a conversational voice agent, every millisecond counts. Here are some examples why:

Examples

RAG apps: Users expect sub-second results. Slow reranking means cold leads and frustrated employees.
Voice AI agents: Jitter in your pipeline breaks the illusion of a human-like dialogue. Quick reranking keeps the conversation flowing.
E-commerce search bars: Users only go through the top ~10 results which need to be very accurate, but every wasted millisecond can make them churn.

When to use a reranker

Tight LLM contexts

Surface the few most relevant documents so your prompt stays under token limits.

Precision-critical workflows

Legal search, medical Q&A or compliance use cases where every bit of relevance matters.

Cost-sensitive scale

Lower inference time means lower compute bills at 100 M+ monthly calls.

Try zerank-1 today

Experience sub-200 ms reranking with top-tier accuracy:

→ API integrate in minutes → Hugging Face pull the weights and run locally

Give your search, agent or RAG pipeline the speed boost it needs.

Related Blogs

Catch all the latest releases and updates from ZeroEntropy.

Apr 02, 2026

Smarter Context Compression for LLM Pipelines: zerank-2 as a Calibrated Classifier

How to use zerank-2's calibrated relevance scores as a binary classifier for context compression, document routing, and multi-label classification — at 50-100x less cost than LLM classification.

Mar 02, 2026

"Let's eat, grandma" vs "let's eat grandma": how embedding models encode the world

A deep dive into how embedding models encode meaning, why famous training examples create the illusion of capability, and what consistent behavior across 10k+ nouns tells us about genuine understanding.

Feb 23, 2026

2026's Top 10 Embedding Companies Powering Search Technology

The best AI teams retrieve with ZeroEntropy

Book Demo View docs