Pretraining Corpus

Also known as: pretraining data, training corpus, pretraining mixture

TL;DR

The trillion-token text mixture that an LLM consumes during pretraining. Composition (web, code, books, math, scientific papers) and mixing ratios are the two highest-leverage choices in pretraining.

A pretraining corpus is the multi-trillion-token mixture an LLM is trained on before any fine-tuning. It’s the single largest determinant of what the model knows, what it can do, and where it fails. Llama-3 trained on 15T tokens, DeepSeek-V3 on 14.8T, GPT-4 on an undisclosed mixture estimated above 13T. Composition matters more than scale alone — at fixed token budget, two models trained on different corpora can differ by 10+ points on downstream benchmarks.

Composition

Modern pretraining corpora are blends across roughly five buckets, with proportions that are themselves tuned hyperparameters.

A typical frontier-scale pretraining mixture

Filtered web text — 60-80%. Common Crawl distilled through quality filters. The bulk of token volume; FineWeb , RefinedWeb, and DCLM are the open exemplars.
Code — 5-15%. GitHub, The Stack, StackExchange. Drives both coding benchmarks and structural reasoning. Models trained on more code are better at math even on natural-language math problems.
Books and long-form prose — 5-10%. Project Gutenberg, Books3, licensed publishers. Provides long-range coherence that web text lacks.
Scientific and math — 1-5%. arXiv, PubMed, OpenWebMath, AlgebraicStack. Disproportionately load-bearing for reasoning capability per token.
Reference and curated — 1-10%. Wikipedia, textbooks, synthetic instruction data, distilled outputs. Highest quality-per-token; usually upsampled multiple epochs.

The web fraction dominates by token count but contributes less per token than the curated tail. Frontier mixtures explicitly upsample the small high-quality buckets — Wikipedia might appear 3-5 epochs in a single “epoch” of the full corpus, while web pages appear once.

Why mixing ratios are the load-bearing decision

A pretraining corpus is a budget. Every additional percentage point of code displaces a percentage point of something else. Empirical ablations from DeepSeek, Llama, and OLMo papers consistently show:

Doubling code share from 5% to 10%: coding benchmarks (HumanEval, MBPP) gain 5-15 points; natural-language perplexity nudges worse by 1-3%.
Adding 5% of curated math (OpenWebMath, ProofPile): math benchmarks (GSM8K, MATH) gain 10+ points at fixed compute.
Upsampling Wikipedia 4×: factuality on TriviaQA and NaturalQuestions improves; creative generation can become drier.

The honest framing: there is no neutral mixture. Every corpus encodes a bet on what the model should be good at. Frontier labs run hundreds of small-scale ablations to set the proportions; the recipe is one of the most closely-guarded artifacts of an LLM project.

Open vs closed corpora

The open landscape has caught up enormously. FineWeb (15T tokens, HuggingFace 2024) replaced C4 and RefinedWeb as the de facto open web baseline. DCLM-Baseline added competitive filtering. RedPajama-V2 reproduced Llama’s recipe. The Stack v2 covers code at frontier scale.

What’s still closed: the precise mixing ratios, the licensed long-form (books, journals), the proprietary instruction and synthetic data, and the late-stage curation passes. The gap between an open 7B model and a closed 7B model in 2026 is mostly in those last 10% by tokens, plus post-training — not in the web bulk.

A frontier corpus contains roughly two zones: the web bulk (cheap, plentiful, heavy-tailed in quality) and the curated tail (expensive, scarce, dense per token). The curated tail — textbook-quality explanations, expert-written long-form, math proofs, instruction examples — encodes far more usable supervision per token because it’s already pre-distilled by humans into clear, high-information text.

The empirical result: a token of textbook is worth roughly 5-20 tokens of average web text in terms of downstream loss reduction. Phi-3 (Microsoft 2024) was the high-water mark of this thesis — a 3.8B model trained predominantly on synthetic textbook-style data that beat 7B+ models trained on raw web. The thesis isn’t “synthetic beats real” so much as “density of supervision beats raw token count.”

Curated data is also where most of the post-Chinchilla overtraining gains come from. You can’t get to 1500 tokens-per-parameter on raw web without quality plateaus; you can on a heavily-curated mixture because each token carries more signal.

Two-stage. First, scaling-law ablations at small scale (100M-1B parameters, 10-100B tokens) sweep mixing ratios across plausible ranges. The team trains dozens of small models, evaluates each on a benchmark battery, and fits a response surface predicting downstream metric as a function of mixture proportions. Second, the predicted-optimal mixture is run at full scale — sometimes with mid-training mixture changes (e.g., upweighting math data in the last 10% of training, the “annealing” phase Llama-3 made canonical).

The brittle part: the small-scale optimum doesn’t always extrapolate. Some capabilities only emerge above certain scales, so mixture choices that look neutral at 1B can be load-bearing at 70B. The Llama-3 paper documents several mid-training mixture corrections after observing that the small-scale predictions undercounted code’s contribution to general reasoning.

The corpus is the most expensive non-compute artifact in an LLM project. Once trained on, it’s effectively immutable — the only way to “fix” a corpus mistake is to retrain. That’s why contamination , deduplication , and quality filtering happen before the run, not after.

Go further

What's actually in a frontier-LLM pretraining corpus?

Roughly: 60-80% filtered web text, 5-15% code, 5-10% books and long-form prose, 1-5% math and scientific text, 1-5% Wikipedia and reference, 1-10% high-quality curated (textbooks, instruction data, synthetic). The exact ratios are unreleased for closed labs but tuned heavily — small mixture changes move benchmark scores meaningfully.

FineWeb Common Crawl

Why does the mixing ratio matter so much?

Ablations consistently show that doubling code share lifts coding evals 5-15 points but can dent natural-language metrics. Up-weighting math and scientific text trades general fluency for reasoning. The corpus is a budget: every percentage point of one source is a percentage point you didn't spend on another, and the model's downstream profile reflects exactly where you spent.

Scaling laws

Are open corpora competitive with frontier-lab corpora?

On filtering recipe and scale, FineWeb-Edu and DCLM-Baseline are within striking distance — open 7B models trained on them get within a few points of equally-sized closed models. The frontier gap is mostly in the curated 5-10% (instruction data, textbooks, synthetic), the licensed long-form (books, papers), and the post-training, not the web base.

FineWeb Synthetic data generation

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs