Evaluation
How to measure retrieval quality and trust the numbers.
Retrieval quality lives or dies on what you measure. The concepts below cover the metrics that drive every reranker and embedding leaderboard — NDCG@K, Recall@K, MRR — plus the public benchmarks that report them (most prominently MTEB). Beyond mechanics, the harder skill is knowing which metric to optimize for your downstream use case, which benchmarks generalize, and where leaderboard numbers diverge from production performance. Most retrieval systems get tuned against the wrong metric for years; the fix is usually one chart away.
- BEIR Benchmark
BEIR is a heterogeneous benchmark of 18 retrieval datasets across domains — biomedical, news, finance, scientific QA, fact-checking — designed to test zero-shot retrieval. The standard reference for whether a retriever generalizes beyond MS MARCO.
- Calibration-Discrimination Analysis
When you compare two scoring systems on the same items — index score vs reranker score, model score vs ground-truth grade — the residuals from a regression line tell you where they disagree.
- Citation Extraction
Citation extraction maps each claim in an LLM-generated answer back to the supporting span in the source documents. Distinct from generation — often a small specialized model — and what makes RAG outputs auditable.
- Classical Test Theory
The 100-year-old psychometric toolkit that ML eval research mostly ignored. Decompose every observed score into $X = T + E$ (true score + error).
- Cohen's Kappa
Cohen's $\kappa = (p_o - p_e) / (1 - p_e)$ — observed agreement minus chance agreement, normalized. The standard inter-annotator agreement metric. Raw % agreement is misleading on imbalanced classes; kappa is the honest version.
- Cronbach's Alpha
$\alpha = (k/(k-1)) \cdot (1 - \sum_q \sigma^2_q / \sigma^2_X)$ — the aggregate internal-consistency reliability number that falls out of CTT.
- Eval Set Quality
A practical diagnostic checklist for *is this benchmark actually any good?* Layer four measurement-theory tools — <Concept slug="classical-test-theory">CTT</Concept> for per-item pathologies, Cronbach's $\alpha$ for aggregate reliability.
- F1 Score
The F1 score is the harmonic mean of precision and recall — a single number that punishes lopsided performance. Standard for classification, rare in retrieval, where ranked metrics like NDCG@K are usually the better choice.
- Faithfulness
Faithfulness is whether each claim in an LLM's answer is actually supported by the retrieved context. Distinct from relevance and from accuracy.
- Graded Relevance LLM Judge
An LLM-as-judge configured to emit *graded* relevance — typically a 0-3 scale (irrelevant / marginal / relevant / highly relevant) rather than a binary yes/no.
- Isotonic Regression
Isotonic regression fits a non-parametric monotone function from raw scores to calibrated probabilities. More flexible than Platt scaling — handles any monotone miscalibration shape — at the cost of needing more labels and being prone to overfitting at the score-distribution tails.
- LLM-as-judge
Using a frontier LLM to score outputs — relevance, faithfulness, answer quality — at scale where human raters can't keep up. Powerful for graded labels, but introduces position bias, verbosity bias, model bias.
- MAP (Mean Average Precision)
Mean Average Precision averages precision at each rank where a relevant document appears, then averages across queries. The older sibling of NDCG — comparable for binary relevance, weaker for graded relevance.
- MRR (Mean Reciprocal Rank)
Mean Reciprocal Rank is the average of 1/rank across queries, where rank is the position of the first relevant document. Heavily front-loaded — only the top result really matters.
- MS MARCO
MS MARCO is Microsoft's web-search dataset of ~1M Bing queries paired with passages and human relevance judgments. The standard training corpus for retrievers and rerankers, the source of every modern dense retriever.
- MTEB
Massive Text Embedding Benchmark — a public benchmark covering 50+ datasets across retrieval, classification, clustering, and more. The de facto leaderboard for embedding models, despite some well-documented limitations in its retrieval portion.
- NDCG@K
Normalized Discounted Cumulative Gain at K is a ranking quality metric that rewards relevant documents appearing high in the result list, with logarithmic discounting for lower positions. The standard top-of-list quality metric for rerankers.
- Platt Scaling
Platt scaling fits a logistic sigmoid on top of a model's raw scores to produce calibrated probabilities. Cheap, two parameters, the standard first-resort calibration method for SVMs, classifiers, and uncalibrated rerankers.
- Precision@K
Precision@K is the fraction of the top-K returned documents that are relevant. The classical IR metric retrieval moved away from in favor of NDCG, but still the right choice when every position in the result list carries equal weight.
- Recall@K
Recall@K is the fraction of queries whose relevant document appears anywhere in the top-K results. It measures whether retrieval found the right document at all — the silent ceiling on every downstream stage.
- Statistical Significance in Retrieval Evals
Retrieval evals report metrics like NDCG@10 averaged across queries — but each query is one sample, and most public benchmarks have hundreds, not thousands. A '+0.5 NDCG' difference is often noise.
- Foundations 48
The bedrock primitives every other topic builds on.
- Data 18
The corpora, curation, and quality decisions that make models possible.
- Language Models 32
The foundational substrate of modern AI.
- Multimodal 13
When text isn't the only signal — vision, audio, and joint embedding spaces.
- Prompting 16
How you talk to an LLM, and when you stop.
- Agents 12
When LLMs become decision-makers in a loop.
- Search & Retrieval 21
How systems find relevant documents in the first place.
- Embeddings 16
The dense-vector layer of modern retrieval.
- Rerankers 9
The second stage that puts the right answer at the top.
- Training Methodology 21
How modern retrieval models get their relevance signal.
- Performance Engineering 25
Squeezing throughput, latency, and memory out of GPUs.
- Production 16
From notebook to live traffic.
