Rerankerszerank-2 · zerank-2-small · zerank-2-nano Embeddingszembed-1 Custom Modelscontext compression · query rewriting · fine-tuning Enterpriseon-prem · dedicated · SLA

Legal Manufacturing Healthcare Finance Customer Support E-Commerce

Documentation Slack Community Discord

BlogEngineering posts, releases, and field notes.ConceptsReference catalog of retrieval + LLM primitives.PlaybooksNamed failure modes with diagnostics and fixes.VersusHead-to-head against every major competitor.EvalsHow we benchmark in production conditions.

Latest from the blog

Matryoshka Is Dead: Why MRL Isn't Lossless for zembed-1

Zemail: Semantic Gmail Search on Claude Code & Cowork

AutoOptimize: Why Your Embedding Model Is the Bottleneck in Agentic AI

Reranking Reddit: What Happens When You Sort Comments by Relevance Instead of Karma

harrier-27b: Can 27B Parameters Beat zembed-1?

Pricing

Rerankerszerank-2 · zerank-2-small · zerank-2-nano Embeddingszembed-1 Custom Modelscontext compression · query rewriting · fine-tuning Enterpriseon-prem · dedicated · SLA

Legal Manufacturing Healthcare Finance Customer Support E-Commerce

Documentation Slack Community Discord

Blog

Matryoshka Is Dead: Why MRL Isn't Lossless for zembed-1

Zemail: Semantic Gmail Search on Claude Code & Cowork

AutoOptimize: Why Your Embedding Model Is the Bottleneck in Agentic AI

Reranking Reddit: What Happens When You Sort Comments by Relevance Instead of Karma

harrier-27b: Can 27B Parameters Beat zembed-1?

Smarter Context Compression for LLM Pipelines: zerank-2 as a Calibrated Classifier

Beyond Binary: A New Version of the MTEB

zembed-1 vs voyage-4: Our Embedding Model Wins on Retrieval

"Let's eat, grandma" vs "let's eat grandma": how embedding models encode the world

Introducing zembed-1: The World's Best Text-Embedding Model

How Assembled Powers High-Quality AI Customer Support with ZeroEntropy

Prompting Best Practices For Instruction-Following Rerankers

Open-source alternatives to Cohere Rerank in 2026

Latency Performance Assessment of zerank-2

Introducing zerank-2: The Most Accurate Multilingual Instruction-Following Reranker

The Latency Myth: Why Reranking Is Still the Smartest Optimization You Can Make

Context Engineering Webinar: Everything You Missed

How Vera Health Achieved State-of-the-Art Clinical Accuracy Using ZeroEntropy

Equall Improves Legal Document Structuring and Retrieval Accuracy with ZeroEntropy

Implementing ZeroEntropy Reranking with turbopuffer Retrieval

Paper TLDR: How we trained zerank-1 with the zELO method

Mem0 Improves Memory Retrieval Accuracy with ZeroEntropy

On The Geometric Limit of Dense Single Vector Embeddings

Should You Use LLMs for Reranking? A Deep Dive into Pointwise, Listwise, and Cross-Encoders

My AskAI Improves Support Agent Latency and Accuracy with ZeroEntropy

Announcing ZeroEntropy's First Rerankers: zerank-1 and zerank-1-small

ZeroEntropy Raises $4.2M Seed Round to Make AI Retrieval Truly Intelligent

Improving Retrieval with ELO Scores

What is a reranker and do I need one?

Deep Dive: The Architecture of ZeroEntropy v1

AGI requires better retrieval, not just better LLMs

LlamaChunk: A General and Cost Efficient Approach to Semantic Chunking

LegalBench-RAG, the First Open-Source Retrieval Benchmark for the Legal Domain

Pricing Evals Sign in

Concepts / Evaluation

Topic · 21 concepts

Evaluation

How to measure retrieval quality and trust the numbers.

Retrieval quality lives or dies on what you measure. The concepts below cover the metrics that drive every reranker and embedding leaderboard — NDCG@K, Recall@K, MRR — plus the public benchmarks that report them (most prominently MTEB). Beyond mechanics, the harder skill is knowing which metric to optimize for your downstream use case, which benchmarks generalize, and where leaderboard numbers diverge from production performance. Most retrieval systems get tuned against the wrong metric for years; the fix is usually one chart away.

BEIR Benchmark

BEIR is a heterogeneous benchmark of 18 retrieval datasets across domains — biomedical, news, finance, scientific QA, fact-checking — designed to test zero-shot retrieval. The standard reference for whether a retriever generalizes beyond MS MARCO.
Calibration-Discrimination Analysis

When you compare two scoring systems on the same items — index score vs reranker score, model score vs ground-truth grade — the residuals from a regression line tell you where they disagree.
Citation Extraction

Citation extraction maps each claim in an LLM-generated answer back to the supporting span in the source documents. Distinct from generation — often a small specialized model — and what makes RAG outputs auditable.
Classical Test Theory

The 100-year-old psychometric toolkit that ML eval research mostly ignored. Decompose every observed score into $X = T + E$ (true score + error).
Cohen's Kappa

Cohen's $\kappa = (p_o - p_e) / (1 - p_e)$ — observed agreement minus chance agreement, normalized. The standard inter-annotator agreement metric. Raw % agreement is misleading on imbalanced classes; kappa is the honest version.
Cronbach's Alpha

$\alpha = (k/(k-1)) \cdot (1 - \sum_q \sigma^2_q / \sigma^2_X)$ — the aggregate internal-consistency reliability number that falls out of CTT.
Eval Set Quality

A practical diagnostic checklist for *is this benchmark actually any good?* Layer four measurement-theory tools — <Concept slug="classical-test-theory">CTT</Concept> for per-item pathologies, Cronbach's $\alpha$ for aggregate reliability.
F1 Score

The F1 score is the harmonic mean of precision and recall — a single number that punishes lopsided performance. Standard for classification, rare in retrieval, where ranked metrics like NDCG@K are usually the better choice.
Faithfulness

Faithfulness is whether each claim in an LLM's answer is actually supported by the retrieved context. Distinct from relevance and from accuracy.
Graded Relevance LLM Judge

An LLM-as-judge configured to emit *graded* relevance — typically a 0-3 scale (irrelevant / marginal / relevant / highly relevant) rather than a binary yes/no.
Isotonic Regression

Isotonic regression fits a non-parametric monotone function from raw scores to calibrated probabilities. More flexible than Platt scaling — handles any monotone miscalibration shape — at the cost of needing more labels and being prone to overfitting at the score-distribution tails.
LLM-as-judge

Using a frontier LLM to score outputs — relevance, faithfulness, answer quality — at scale where human raters can't keep up. Powerful for graded labels, but introduces position bias, verbosity bias, model bias.
MAP (Mean Average Precision)

Mean Average Precision averages precision at each rank where a relevant document appears, then averages across queries. The older sibling of NDCG — comparable for binary relevance, weaker for graded relevance.
MRR (Mean Reciprocal Rank)

Mean Reciprocal Rank is the average of 1/rank across queries, where rank is the position of the first relevant document. Heavily front-loaded — only the top result really matters.
MS MARCO

MS MARCO is Microsoft's web-search dataset of ~1M Bing queries paired with passages and human relevance judgments. The standard training corpus for retrievers and rerankers, the source of every modern dense retriever.
MTEB

Massive Text Embedding Benchmark — a public benchmark covering 50+ datasets across retrieval, classification, clustering, and more. The de facto leaderboard for embedding models, despite some well-documented limitations in its retrieval portion.
NDCG@K

Normalized Discounted Cumulative Gain at K is a ranking quality metric that rewards relevant documents appearing high in the result list, with logarithmic discounting for lower positions. The standard top-of-list quality metric for rerankers.
Platt Scaling

Platt scaling fits a logistic sigmoid on top of a model's raw scores to produce calibrated probabilities. Cheap, two parameters, the standard first-resort calibration method for SVMs, classifiers, and uncalibrated rerankers.
Precision@K

Precision@K is the fraction of the top-K returned documents that are relevant. The classical IR metric retrieval moved away from in favor of NDCG, but still the right choice when every position in the result list carries equal weight.
Recall@K

Recall@K is the fraction of queries whose relevant document appears anywhere in the top-K results. It measures whether retrieval found the right document at all — the silent ceiling on every downstream stage.
Statistical Significance in Retrieval Evals

Retrieval evals report metrics like NDCG@10 averaged across queries — but each query is one sample, and most public benchmarks have hundreds, not thousands. A '+0.5 NDCG' difference is often noise.