Data Curation

Also known as: corpus curation, data preparation, data cleaning

TL;DR

The umbrella discipline of preparing a corpus for training: filtering, deduplication, quality scoring, language ID, and classifier-based selection.

Data curation is the engineering discipline of turning a raw corpus into something a model can profitably train on. Filtering, deduplication, language identification, quality scoring, classifier-based selection, decontamination against evaluation sets, and topical reweighting all sit under the same umbrella. In 2026 it is the single biggest determinant of model quality at fixed compute, and the place where modern pretraining teams spend the bulk of their engineering hours.

The architecture stopped being a differentiator. The corpus is what’s left.

Why curation eats the schedule

Frontier labs run near-identical decoder-only transformers with similar optimizers, similar parallelism, and similar context lengths. What separates a top-tier base model from a mediocre one is what went into pretraining. FineWeb-Edu, a curated 1.3T-token slice of Common Crawl filtered by an educational-value classifier, produces a model that beats raw Common Crawl by 4-6 points on MMLU at fixed compute. That is a larger lift than any single architectural change shipped in the same period.

The corollary is that “more data” is not free. Adding 10T tokens of low-quality web text can hurt downstream performance versus 2T curated tokens — the gradient updates get spent learning artifacts instead of regularities. Curation is therefore a quality-versus-quantity tradeoff, and the curve has a peak.

The canonical pipeline

Stages of a modern curation pipeline

Source acquisition. Common Crawl WARC dumps, GitHub mirrors, ArXiv, Stack Exchange, books, scientific journals where licensing permits.
Text extraction. HTML to text via Trafilatura or Resiliparse; PDF via Nougat or marker.
Language ID. fastText or CLD3; drop documents below the confidence threshold for the target languages.
URL and domain filtering. Blocklists for porn, malware, toxic forums, paywalled content. UT1 is the public starting point.
Deduplication. Exact-match plus near-duplicate MinHash-LSH at line or paragraph granularity. Cuts 30-60% of typical web dumps.
Heuristic filters. Length, symbol-to-word ratio, repetition rate, stop-word density. Catches boilerplate and machine-generated junk.
Classifier filtering. A fastText or tiny transformer scores documents on a learned “educational value” target. FineWeb-Edu’s classifier is the canonical recipe.
Decontamination. Strip n-gram overlaps with held-out evals so leaderboard numbers are trustworthy.
Mixing. Reweight sources to a target distribution — see data mixing .

The dominant recipe is MinHash-LSH at the document or paragraph level. Each document is shingled into overlapping 5-grams; each shingle set is hashed into a fixed-length MinHash signature (typically 128 or 256 values). Documents with Jaccard similarity above 0.7-0.85 share a large fraction of their signatures with high probability, and locality-sensitive hashing buckets candidates in approximately linear time instead of comparing all pairs.

The practical wrinkle is that you dedup at multiple granularities — line-level catches navigation chrome and boilerplate, paragraph-level catches templated reviews, document-level catches mirrored sites. Cumulative cut on a Common Crawl dump is 50-70% of bytes. Layered dedup also reduces extractable-training-data rates measurably, with privacy and benchmark-leakage implications.

The classifier’s notion of “quality” is a value judgment encoded in the labels. FineWeb-Edu’s classifier was trained on Llama-3 judgments of educational value; Llama-3’s notion of educational value reflects its own training data. Iterate that loop and you get corpora that look more and more like the modal Llama-style response, with less long-tail variance and fewer dialectal voices.

The empirical case is strong — every recent open-weight pretraining run that does classifier filtering beats one that doesn’t — but the long-term failure mode is monoculture. Diversity of the underlying corpus is the partial mitigation; classifier ensembles trained on heterogeneous label sources is the better one.

Where curation underdelivers

The diminishing-returns regime hits eventually. Past a certain point, more aggressive filtering removes more signal than noise — the corpus shrinks faster than the model improves. The right operating point depends on the model size and target compute budget; small models tolerate noisier corpora because they can’t memorize as much, large models benefit more from aggressive filtering. Curation is the most important knob in pretraining and one of the easiest to over-tune.

Go further

Why did data curation become the dominant LLM engineering activity?

Because the modeling architecture stopped being a differentiator around 2023. Every frontier lab runs a near-identical decoder-only transformer with similar optimizers and similar parallelism. What separates a top-tier model from a mediocre one is the corpus that went in. Quality filtering on FineWeb-Edu produces a model that beats one trained on raw Common Crawl with the same compute and parameters.

Pretraining Scaling laws

What does a curation pipeline actually look like end-to-end?

Roughly: (1) source dump (Common Crawl WARC, GitHub, ArXiv, books); (2) text extraction and language ID; (3) URL and domain blocklists; (4) line-level and exact-document deduplication; (5) heuristic quality filters (length, symbol ratio, repetition); (6) classifier-based filtering for educational value; (7) decontamination against eval sets; (8) topical mixing weights. Each stage drops 10-90% of the input.

Data mixing Dataset cards

What are the highest-leverage curation steps in practice?

Deduplication first — removes the largest fraction of garbage and the largest fraction of memorization risk. Then classifier-based quality filtering (FineWeb-Edu's fastText classifier scoring educational value is the canonical recipe). Then decontamination against the evals you actually report. The other steps matter, but these three move the most needle per engineering hour.

Pretraining Synthetic data generation

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs