Deduplication

Also known as: dedup, near-duplicate detection, MinHash dedup, document deduplication

TL;DR

Removing exact and near-duplicate documents from a training corpus. Exact dedup uses content hashing; near-dedup uses MinHash, SimHash, or embedding-based similarity.

Deduplication is removing documents from a training corpus that are exact or near-duplicates of others already present. On every reproducible pretraining corpus ablation since 2021, it has been the single highest-leverage filtering step — a dedup pass typically improves downstream loss by more than any quality filter, while reducing token volume by 50-70%. Lee et al. 2021 (“Deduplicating Training Data Makes Language Models Better”) is the load-bearing reference; FineWeb , RefinedWeb, and DCLM all build on its findings.

Why duplicates are toxic to training

A model trained on a corpus where document D appears 50 times learns D 50× more strongly than a document appearing once. That has three measurable bad consequences:

Memorization spikes. The model verbatim-reproduces duplicated content at sampling time. Lee et al. showed exact reproduction rates drop ~10× after dedup at fixed compute.
Overweighted opinions. A widely-republished article — a press release, a syndicated story — biases the model’s distribution toward whatever it says. Frequently-mirrored content (often spam or low-quality SEO copy) dominates.
Wasted compute. Each duplicate is a forward+backward pass on the same gradient signal. Token budget spent on duplicates is token budget not spent on novel data.

Empirically: a deduplicated 1T-token corpus reaches lower validation loss than a non-deduplicated 1T-token corpus from the same source, even though the deduplicated version has fewer “unique training opportunities” to gradient-descend on. Duplicates aren’t just wasted — they actively hurt.

Three layers of dedup

The dedup hierarchy

Exact dedup. SHA-256 (or murmur3) hash of normalized document content. Catches verbatim copies. Cheap, ~O(N) time. Removes 10-30% of typical web crawls.
Near-duplicate dedup (MinHash / SimHash). Locality-sensitive hashes that bucket similar documents. Catches paraphrased reposts, template variations, partial overlaps. Removes another 30-50%.
Embedding-based semantic dedup. Vector embedding of each document; cluster or cosine-similarity threshold. Catches meaningfully-similar documents that aren’t textually similar — articles about the same event, paraphrases at the sentence level. Expensive at trillion-token scale; less standard, more experimental.

The standard FineWeb / RefinedWeb / DCLM recipe is exact + MinHash. Embedding-based dedup is used in some specialized pipelines (e.g. SemDeDup) but hasn’t become the default at frontier scale.

How MinHash works (mechanically)

MinHash estimates the Jaccard similarity between two sets — here, the sets of token n-grams (typically 5-grams) of two documents. Two near-duplicate documents share most of their n-grams, so their Jaccard is high; unrelated documents share few, so theirs is low.

The trick: instead of comparing every pair of documents (which is O(N²) and infeasible at trillions), MinHash uses a locality-sensitive hash — multiple random hash functions over the n-gram set, where each hash returns the minimum hash value across all n-grams. Two documents with high Jaccard will share many of these minimum hashes; two with low Jaccard will share few. Bucket documents by groups of MinHash signatures (LSH banding); only documents that land in the same bucket are candidate duplicates.

The whole thing reduces a Θ(N²) problem to roughly O(N log N) with bounded recall on duplicates above the chosen Jaccard threshold. At trillion-document scale, this is the difference between feasible and infeasible.

The standard threshold is Jaccard similarity 0.7 to 0.85 over 5-gram or 13-gram sets. Below 0.7 you start dropping documents that are merely on the same topic; above 0.85 you miss many true near-duplicates that have been lightly rewritten.

The threshold is a precision/recall trade-off. Lower threshold catches more duplicates but risks dropping legitimately distinct articles that happen to share boilerplate (legal disclaimers, syndicated headers). FineWeb uses 0.85 with 5-grams; RefinedWeb uses 0.8 with longer n-grams. The choice is empirical; teams ablate with a held-out judgment set.

Cross-snapshot dedup is its own decision. FineWeb dedups within each snapshot independently, then concatenates — preserves cross-snapshot duplication intentionally because the same page being recrawled monthly is a weak signal of importance. Other recipes (DCLM, RedPajama-V2) dedup globally across snapshots. The gap in downstream metrics between these choices is small but real.

At pretraining scale, mostly no — yet. SemDeDup and similar methods compute document embeddings, cluster, and remove near-cluster-center duplicates. The cost is the embedding pass over a trillion documents, plus the clustering. With cheap embedders (ModernBERT, GTE-small) it’s becoming tractable but still adds significant infrastructure.

What semantic dedup catches that MinHash misses: paraphrases, translated reposts, articles about the same event with different wording. The empirical question is whether removing these helps. Some ablations (Abbas et al. 2023) show modest gains; others show neutral or slight regressions because semantic dedup also removes legitimate diversity (multiple news articles about the same topic from different perspectives).

For specialized fine-tuning corpora — instruction data, retrieval training data — semantic dedup is more clearly worthwhile because the data volume is small enough to make the embedding pass cheap and the diversity stakes per example are higher.

What dedup doesn’t solve

Dedup removes redundant supervision but doesn’t address content quality, factual accuracy, or contamination. A unique low-quality document survives dedup. A unique benchmark example leaked into pretraining survives dedup — that’s the contamination problem, which requires its own detection methods (n-gram match against eval sets, log-prob attacks).

The honest framing: dedup is necessary but not sufficient. It is, however, the cheapest single-step quality improvement available — every team should run it first, before reaching for fancier filters.

Go further

How much of a typical web crawl is actually duplicate?

Roughly 60-80% by token count. Common Crawl alone has ~50% near-duplicates within a single snapshot and another 30-40% cross-snapshot. After exact-content dedup you've cut maybe 20-30%; the bulk of the savings comes from near-duplicate detection at MinHash threshold 0.85 or so.

Common Crawl FineWeb

Why is dedup the highest-leverage filter, ahead of quality classifiers?

A duplicate document trains the model to over-memorize that specific text and over-weight whatever views it expresses. Lee et al. 2021 showed dedup alone reduces memorization by 10× and improves perplexity. Quality classifiers help further but on a deduplicated base; running a quality filter on a duplicated corpus retains the duplicates that score well, multiplying the memorization problem.

Pretraining corpus Data contamination

What's the difference between MinHash and SimHash exactly?

MinHash estimates Jaccard similarity over sets (typically token n-gram sets). SimHash estimates cosine similarity over weighted feature vectors. Both produce locality-sensitive hashes that bucket similar docs together. MinHash dominates pretraining-scale dedup because Jaccard over n-grams is the right metric for 'mostly-the-same text'; SimHash is more common for shorter document deduplication where weighted features matter more.

Pretraining corpus

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs