Back

Bi-Encoders vs Cross-Encoders

Mar 18, 2026 ·

TL;DR

Every embedding model is a bi-encoder: it encodes queries and documents independently, which is what makes vector search fast — and what limits its accuracy.
Every reranker is a cross-encoder: it reads the query and document together, which is what makes reranking precise — and what makes it too expensive for first-stage retrieval.
Production search uses both. The bi-encoder retrieves candidates in milliseconds; the cross-encoder rescores them in order of actual relevance. This is the retrieve-then-rerank pattern, and it is how virtually every serious search and RAG system works today.

Two Architectures, One Pipeline

Every modern search and RAG system runs two models in sequence: an embedding model for fast retrieval, and a reranker for precise scoring. These correspond to two fundamentally different transformer architectures — bi-encoders and cross-encoders — with different strengths, different costs, and a well-understood division of labor in production.

This post covers what each architecture does, where each one breaks, and why the combination outperforms either alone.

Bi-Encoders and Cross-Encoders

A bi-encoder encodes query and document independently into vectors, then compares them via cosine similarity. This is what embedding models do — OpenAI’s text-embedding-3-large, Cohere’s embed-v4.0, voyage-4, and our own zembed-1. Because document embeddings are pre-computed, retrieval over millions of documents takes single-digit milliseconds.

A cross-encoder concatenates query and document into a single sequence and passes them through a transformer together. This is what rerankers do — Cohere Rerank, Voyage Rerank, and our own zerank-2. Because every query token attends to every document token, a cross-encoder runs one forward pass per (query, document) pair.

The architectural difference is easiest to see on negation. Given the query “companies that did not go bankrupt” and two documents — one about a company filing for Chapter 11, one about a company reporting record profits:

	Bankruptcy doc	Profitable doc	Δ
Embedding model	0.57	0.57	0.00
Reranker	0.30	0.54	+0.24

The embedding model cannot tell which document matches. The reranker can.

Bi-encoders scale to billions of documents. Cross-encoders score one pair at a time. Production search uses both.

Retrieve-Then-Rerank

In practice, the two architectures are complementary. The standard production pattern chains them:

The bi-encoder retrieves the top 100 candidates from the full corpus in ~5ms. The cross-encoder rescores those 100 candidates in ~50ms. The reranked top 10 go to your application.

How Much Does Reranking Actually Move the Needle?

On our evaluations, adding zerank-2 on top of zembed-1 retrieval improves NDCG@10 by 5-20% across verticals. The gains are largest exactly where you’d expect: legal, healthcare, and finance, where queries are nuanced and keyword overlap is a poor proxy for relevance.

The aggregate number understates what happens at the individual query level. The reranker corrects two specific failure modes:

False positives demoted. Documents ranked highly by the bi-encoder due to vocabulary overlap with the query, despite not actually answering it. These get pushed down or out of the top 10 entirely.

False negatives promoted. Documents that are genuinely relevant but were underscored by the bi-encoder because their surface text diverges from the query. In the pipeline diagram above, doc 47 — ranked 47th by the bi-encoder — jumps to position 3 after reranking. Without the cross-encoder, that document never enters the context window.

In our voyage-4 comparison, when we looked at documents where zembed-1 and voyage-4 disagree on ranking and asked three separate LLMs to judge which ordering was correct, zembed-1 was preferred by a 15–20% margin. When filtered to top-10 disagreements — the ones that actually affect your application — the gap widened to 27–33%.

First-stage retrieval quality determines the ceiling for the reranker. Reranker quality determines whether that ceiling is reached.

Why zembed-1 and zerank-2 Are Trained Together

Most embedding models are trained on binary labels: relevant or not relevant. Most rerankers are trained independently with their own methodology. Mixing providers means the two stages may disagree on what “relevant” means.

zembed-1 was distilled directly from zerank-2, which was trained with our zELO methodology — pairwise battles between documents that produce continuous, calibrated Elo scores from 0 to 1.

When the bi-encoder’s notion of relevance diverges from the cross-encoder’s, the reranker spends its budget correcting disagreements rather than refining an already-good ranking. Shared training avoids this.

When You Don’t Need a Reranker

Not every application justifies the added latency:

Skip the reranker when…	Use a reranker when…
Accuracy is not a priority (browsing, exploration)	Result quality directly affects output quality (RAG, agents)
You need sub-10ms total latency	You can tolerate ~50–100ms for reranking
You’re returning hundreds or thousands of results	You’re returning up to 10 or 100 results
Your queries are simple keyword lookups	Your queries are natural language, nuanced, or domain-specific

For most production RAG systems, the answer is straightforward: the cost of a bad retrieval — a hallucinated answer, a missed clause, a wrong recommendation — far exceeds the cost of 50ms of reranking.

Get Started

Try zembed-1 and zerank-2 together in your search pipeline.

→ ZeroEntropy API fully managed, lowest-friction path to production → HuggingFace open weights for both zembed-1 and zerank-2 → AWS Marketplace deploy within your existing AWS environment

from zeroentropy import ZeroEntropy

ze = ZeroEntropy()

# Stage 1: Embed and retrieve
query_embedding = ze.models.embed(
    model="zembed-1",
    input_type="query",
    input="What are the tax implications of stock options?",
)

# Stage 2: Rerank the top candidates
results = ze.models.rerank(
    model="zerank-2",
    query="What are the tax implications of stock options?",
    documents=candidate_documents,  # from your vector DB
)

Documentation: docs.zeroentropy.dev

HuggingFace: huggingface.co/zeroentropy

Get in touch: Discord community or contact@zeroentropy.dev

Related Blogs

Catch all the latest releases and updates from ZeroEntropy.

Apr 10, 2026

The Best Embedding Model for Finance in 2026: Why zembed-1 Wins

zembed-1 outperforms all benchmarked competitors on finance-domain retrieval, with a 32k context window, flexible compression, and Elo-calibrated relevance for regulatory compliance, earnings analysis, and investment research.

Apr 10, 2026

The Best Embedding Model for Legal in 2026: zembed-1 Sets the Standard

zembed-1 achieves 0.6723 NDCG@10 on legal retrieval benchmarks, outperforming all competitors by up to +31.8%, with Elo-calibrated relevance, 32k context, and quantization for massive legal corpora.

Apr 10, 2026

The Best Embedding Model for Healthcare in 2026: zembed-1 Leads the Field

zembed-1 achieves 0.6260 NDCG@10 on healthcare retrieval benchmarks, leading competitors by up to +31.8%, with multilingual support, 32k context, and self-hosting for HIPAA compliance.

The best AI teams retrieve with ZeroEntropy

Book Demo View docs