Also known as: benchmark contamination, test-set leakage, eval contamination, train-test overlap
TL;DR
When evaluation data leaks into training data, inflating benchmark scores without improving real capability. Detected via n-gram match against eval sets, log-probability attacks, or membership inference.
Data contamination is the silent killer of LLM benchmarks: evaluation examples that appeared in the model’s training corpus inflate its scores without reflecting real capability. The model isn’t being measured on its ability to generalize; it’s being measured on its ability to recall. Every major benchmark — MMLU, GSM8K, HumanEval, BBH, BIG-Bench, even retrieval evals like MS MARCO — has documented contamination in modern pretraining corpora , and the gap between “headline benchmark score” and “real capability” widens every year as more benchmarks become contaminated.
Why contamination is structural, not accidental
The contamination loop is mechanical:
A benchmark is published as a paper plus a dataset on GitHub or HuggingFace.
People write about it — leaderboards, blog posts, model cards, Stack Overflow questions, course materials.
Common Crawl scrapes those pages.
The next pretraining corpus distilled from CC includes the discussion pages.
The next model trains on those pages, seeing benchmark questions and (often) answers verbatim.
The benchmark doesn’t even need to be in the corpus directly — discussion of the benchmark is enough, and that discussion is unblockable without aggressive proper-name URL filters that hurt other content. Deduplication doesn’t help because each benchmark item typically appears once or a few times, well below the threshold for near-duplicate flagging.
Detection methods
Three families of contamination tests, with different strengths.
Contamination detection techniques
N-gram exact match. Hash every 13-gram from the eval set; check against the pretraining corpus. Catches verbatim leakage. Standard pre-training contamination index uses 13-grams; some labs use 7- or 8-grams for higher recall.
Substring overlap. Longest common substring per (eval example, training document). More expensive but catches paraphrased and partially-translated leaks that 13-grams miss.
Membership inference / log-prob attacks. Min-k% probability (Shi et al. 2023) — examine the K% lowest-probability tokens of the eval example under the model. Contaminated examples have fewer surprisingly-low-probability tokens than uncontaminated ones because the model has memorized them.
Behavioral tests. Ask the model to complete a benchmark prompt with the prefix only; if it can recover the canonical answer (or the exact wording of the original) more than chance, it likely saw it in training.
Counterfactual reformulation. Rewrite each eval item with surface changes (different numbers, different names, different phrasing) that preserve the answer; large performance gaps between original and reformulated indicate memorization of original.
The combination is more reliable than any single method. N-gram match is the default first pass because it’s cheap and produces an interpretable contamination index — “X% of MMLU items have a 13-gram match somewhere in the corpus.” Log-prob attacks and counterfactual reformulation catch what n-gram misses.
For a piece of text the model may or may not have seen, compute the per-token log-probability under the model. Sort tokens by log-probability ascending; take the bottom K% (typically 20%). Sum these. The intuition: even on text the model has seen, most tokens are highly probable; what differs is that on unseen text, some tokens will be genuinely surprising (low probability), while on seen text the model has memorized even the unusual tokens.
Shi et al. 2023 (“Detecting Pretraining Data from Large Language Models”) showed this score reliably distinguishes seen vs. unseen passages from the model’s perspective, well above chance. The method is white-box-ish — you need log-probabilities, which open-weight models give you and most API models partially expose (via logprobs in completion responses).
The limitation: min-k% works on substantial passages (a paragraph or more); single-question benchmark items are too short for the statistic to be reliable. For benchmark contamination specifically, n-gram and behavioral tests are more practical.
Production implications
For a frontier-lab pretraining run, the contamination workflow is:
Pre-training contamination index. For every benchmark you plan to evaluate on, compute n-gram overlap of eval items against the pretraining corpus before training. Document it.
Decontamination. Either remove contaminated documents from the corpus, or accept the contamination and disclose it. Removing every benchmark mention is hard — the n-gram overlap can be in tangentially-related discussions.
Post-training audit. After the model finishes, re-run contamination checks on the trained model (log-prob attacks, behavioral tests). Compare benchmark scores on contaminated vs. uncontaminated subsets.
Held-out private evals. The only really trustworthy comparison is on data that was never on the public web — internal-only test sets, freshly authored evaluation items, or benchmarks released after the model’s pretraining cutoff.
For a smaller fine-tuning run on top of a base model, you have an additional contamination surface: your fine-tuning data. If your fine-tuning data overlaps with your eval set, you’ll see inflated post-training metrics that don’t survive deployment. Always train/eval split fine-tuning data with the same n-gram check.
Some do — MMLU-Pro, GPQA, and FrontierMath were created partly to escape contamination of their predecessors. But every refresh has the same problem: the moment it becomes the load-bearing public benchmark, it goes on the web in discussions, leaderboards, and example outputs. The half-life resets but the dynamic is unchanged.
The structural alternatives are (a) private held-out benchmarks (Scale Eval, Vals, Anthropic’s internal evals) which require trust in the evaluator, (b) benchmarks released only with delayed answer keys, (c) benchmarks generated dynamically per query (so each model sees new items), or (d) capability-specific evals (red-teaming, coding-on-novel-tasks) that are hard to memorize. The frontier eval ecosystem in 2026 is shifting toward (a) and (d); (b) and (c) have logistical and validity problems.
What contamination explains
Contamination is the most likely explanation when:
A model jumps several points on a public benchmark without architectural or training changes that would predict it.
A model’s score on a benchmark exceeds the gap between similar-sized peers’ scores by more than the noise floor on that benchmark.
A model performs much better on the original benchmark than on a counterfactual rewrite.
A model can complete eval items prompted only with their prefix.
When you observe these, the right next step is contamination detection — not celebration.
Go further
Which contamination detection method should I actually run?
N-gram exact-match (typically 13-gram) is the standard first pass — fast, easy to ship, low false-positive rate. Augment with min-k% probability test (Shi et al. 2023) for paraphrased contamination. For frontier-scale work, run a contamination index across every benchmark you plan to evaluate on, before training. Detection after-the-fact rarely changes anything.
Why does benchmark contamination keep happening despite years of awareness?
Public benchmarks are on the public web; pretraining corpora are scraped from the public web. The benchmark questions, often as hyperlinked discussions, leaderboard pages, or model-card examples, appear in Common Crawl. Even with explicit blocklists, paraphrases, partial leaks, and translated versions slip through. The only robust fix is private held-out evaluation, which has its own credibility problems.
How much does contamination actually inflate scores?
On clean cases, contamination of MMLU questions adds 5-15 percentage points; GSM8K can move 10-20 points. The 2023 contamination audits of GPT-4 and Llama-2 found MMLU contamination rates of 10-30% by item. Not every contaminated item is recovered verbatim — but the lift is large enough that benchmark deltas under 5 points across models are dominated by contamination noise, not capability differences.