- Every embedding model is a bi-encoder: it encodes queries and documents independently, which is what makes vector search fast — and what limits its accuracy.
- Every reranker is a cross-encoder: it reads the query and document together, which is what makes reranking precise — and what makes it too expensive for first-stage retrieval.
- Production search uses both. The bi-encoder retrieves candidates in milliseconds; the cross-encoder rescores them in order of actual relevance. This is the retrieve-then-rerank pattern, and it is how virtually every serious search and RAG system works today.
Two Architectures, One Pipeline
Every modern search and RAG system runs two models in sequence: an embedding model for fast retrieval, and a reranker for precise scoring. These correspond to two fundamentally different transformer architectures — bi-encoders and cross-encoders — with different strengths, different costs, and a well-understood division of labor in production.
This post covers what each architecture does, where each one breaks, and why the combination outperforms either alone.
Bi-Encoders and Cross-Encoders
A bi-encoder encodes query and document independently into vectors, then compares them via cosine similarity. This is what embedding models do — OpenAI’s text-embedding-3-large, Cohere’s embed-v4.0, voyage-4, and our own zembed-1. Because document embeddings are pre-computed, retrieval over millions of documents takes single-digit milliseconds.
A cross-encoder concatenates query and document into a single sequence and passes them through a transformer together. This is what rerankers do — Cohere Rerank, Voyage Rerank, and our own zerank-2. Because every query token attends to every document token, a cross-encoder runs one forward pass per (query, document) pair.
The architectural difference is easiest to see on negation. Given the query “companies that did not go bankrupt” and two documents — one about a company filing for Chapter 11, one about a company reporting record profits:
| Bankruptcy doc | Profitable doc | Δ | |
|---|---|---|---|
| Embedding model | 0.57 | 0.57 | 0.00 |
| Reranker | 0.30 | 0.54 | +0.24 |
The embedding model cannot tell which document matches. The reranker can.
Bi-encoders scale to billions of documents. Cross-encoders score one pair at a time. Production search uses both.
Retrieve-Then-Rerank
In practice, the two architectures are complementary. The standard production pattern chains them:
The bi-encoder retrieves the top 100 candidates from the full corpus in ~5ms. The cross-encoder rescores those 100 candidates in ~50ms. The reranked top 10 go to your application.
How Much Does Reranking Actually Move the Needle?
On our evaluations, adding zerank-2 on top of zembed-1 retrieval improves NDCG@10 by 5-20% across verticals. The gains are largest exactly where you’d expect: legal, healthcare, and finance, where queries are nuanced and keyword overlap is a poor proxy for relevance.
The aggregate number understates what happens at the individual query level. The reranker corrects two specific failure modes:
False positives demoted. Documents ranked highly by the bi-encoder due to vocabulary overlap with the query, despite not actually answering it. These get pushed down or out of the top 10 entirely.
False negatives promoted. Documents that are genuinely relevant but were underscored by the bi-encoder because their surface text diverges from the query. In the pipeline diagram above, doc 47 — ranked 47th by the bi-encoder — jumps to position 3 after reranking. Without the cross-encoder, that document never enters the context window.
In our voyage-4 comparison, when we looked at documents where zembed-1 and voyage-4 disagree on ranking and asked three separate LLMs to judge which ordering was correct, zembed-1 was preferred by a 15–20% margin. When filtered to top-10 disagreements — the ones that actually affect your application — the gap widened to 27–33%.
First-stage retrieval quality determines the ceiling for the reranker. Reranker quality determines whether that ceiling is reached.
Why zembed-1 and zerank-2 Are Trained Together
Most embedding models are trained on binary labels: relevant or not relevant. Most rerankers are trained independently with their own methodology. Mixing providers means the two stages may disagree on what “relevant” means.
zembed-1 was distilled directly from zerank-2, which was trained with our zELO methodology — pairwise battles between documents that produce continuous, calibrated Elo scores from 0 to 1.
When the bi-encoder’s notion of relevance diverges from the cross-encoder’s, the reranker spends its budget correcting disagreements rather than refining an already-good ranking. Shared training avoids this.
When You Don’t Need a Reranker
Not every application justifies the added latency:
| Skip the reranker when… | Use a reranker when… |
|---|---|
| Accuracy is not a priority (browsing, exploration) | Result quality directly affects output quality (RAG, agents) |
| You need sub-10ms total latency | You can tolerate ~50–100ms for reranking |
| You’re returning hundreds or thousands of results | You’re returning up to 10 or 100 results |
| Your queries are simple keyword lookups | Your queries are natural language, nuanced, or domain-specific |
For most production RAG systems, the answer is straightforward: the cost of a bad retrieval — a hallucinated answer, a missed clause, a wrong recommendation — far exceeds the cost of 50ms of reranking.
Get Started
Try zembed-1 and zerank-2 together in your search pipeline.
from zeroentropy import ZeroEntropy
ze = ZeroEntropy()
# Stage 1: Embed and retrieve
query_embedding = ze.models.embed(
model="zembed-1",
input_type="query",
input="What are the tax implications of stock options?",
)
# Stage 2: Rerank the top candidates
results = ze.models.rerank(
model="zerank-2",
query="What are the tax implications of stock options?",
documents=candidate_documents, # from your vector DB
)Documentation: docs.zeroentropy.dev
HuggingFace: huggingface.co/zeroentropy
Get in touch: Discord community or contact@zeroentropy.dev
