Data Mixing

Also known as: mixture weights, data ratios, domain weighting, DoReMi

TL;DR

The ratio decisions in a pretraining corpus — what fraction of web vs code vs math vs books vs scientific papers. Second-most-important choice in pretraining after corpus selection itself.

Data mixing is the choice of what fraction of a corpus comes from each source — web text, code, papers, books, math, multilingual text, dialogue. After corpus selection itself, mixture weights are the most consequential pretraining decision: at fixed compute, two models trained on the same source pool with different mix ratios differ by several points on every downstream benchmark.

Why mix ratios matter

The naive view is that data is data — pile everything in and let sort it out. Domains actually differ in three ways the loss cares about: per-token information density (code is denser than chat logs), overlap with downstream tasks (math papers are closer to GSM8K than is sports news), and difficulty (some domains saturate faster). A uniform sample over-trains on the easy abundant sources and under-trains on the hard scarce ones.

Mix optimization finds ratios that maximize downstream performance at a given compute budget. The decision matters most for small models, least for compute-rich frontier runs that can afford to over-sample everything.

How modern mixes are chosen

Approaches to picking mixture weights
  • Hand-tuned. GPT-3-era recipes — a domain expert picks weights based on intuition and a few small ablations. Cheap, brittle, hard to defend.
  • Single-domain ablations. Train small models on each domain alone, weight proportional to downstream affinity. Misses cross-domain interactions.
  • DoReMi (Xie et al., 2023). Train a uniform-mix reference; train a proxy with weights adjusted via min-max optimization on per-domain excess loss. Weights transfer to the full pretraining run.
  • Online weight tuning. Adjust weights during pretraining based on per-domain validation loss. Used in some Llama-3-era runs.
  • Scaling-law-based. Fit per-domain scaling laws, pick weights that maximize predicted performance on a held-out objective. Computationally heaviest but most defensible.

DoReMi is the recipe most teams have converged toward. If a model is over-fitting one domain (loss far below the reference) and under-fitting another, upweight the under-fit one until per-domain excess losses are roughly equal. The math is a robust optimization framing — minimize worst-case excess loss across domains — solved efficiently with a small proxy whose weights transfer.

Public ablations from these labs mostly tell the same story. (1) Aggressive dedup and quality filtering matters more than mix-ratio fine-tuning — a clean mix at suboptimal ratios beats a noisy mix at optimal ones. (2) Code fraction in the 5-15% range improves text reasoning benchmarks; above 25% it hurts prose generation. (3) Math-heavy sources (ArXiv, math papers, math forums) at 1-3% reliably lift GSM8K and MATH. (4) Multilingual ratios matter for the languages in question but barely affect English benchmarks until they exceed 30-40% non-English.

What none of these clearly answer is which mix is best for which task family. The mix that maximizes MMLU is not the one that maximizes HumanEval, and the gap is several points on each. Production mixes are Pareto-frontier compromises across the benchmark suite the team cares about, not single global optima.

A mix optimal for a 1B model is rarely optimal for a 70B. Two effects compound: larger models can absorb more diverse data without saturating, and scaling laws push the optimal token count up roughly linearly with parameter count, which can drive some domains past their natural ceiling.

The practical consequence is that weights optimized on a small proxy and applied to a large run leave performance on the table. The fix is to optimize on a proxy as close to the target scale as compute allows, and to validate with a brief large-scale ablation before committing. The mix-scale interaction also has a domain signature: code and math benefit from larger relative weights as scale grows; pure web text plateaus earlier.

What mix optimization can’t fix

Mix optimization also doesn’t survive distribution shift. A mix tuned on a math-heavy benchmark suite produces a model strong on math and average elsewhere; if the production use is customer-support chat, the tuned weights are wrong. Align the mix-optimization objective with the deployment objective, not with leaderboard composites.

Treat mix weights as a hyperparameter on par with learning rate — important enough to optimize, not so important that mistuning catastrophically breaks the run.

Go further

How wrong can a mix get and still ship a usable model?

Pretty wrong. The original GPT-3 paper used a hand-tuned mix that nobody believes was optimal, and the model still set the field on fire. The cost of a bad mix is typically 1-3 points of downstream accuracy on any given benchmark — significant for leaderboard contention, irrelevant for whether the model is usable. The reason mix optimization is now table-stakes is that the field has saturated everywhere else.

What does DoReMi actually do?

Train a small reference model on a uniform domain mix; train a small proxy model with adjustable per-domain weights and a min-max objective that upweights domains where the proxy underperforms relative to the reference; the resulting weights transfer to the full pretraining run. It's a robust optimization framing — minimize the worst-case excess loss across domains.

Why does code data help text reasoning?

The dominant theory is that code data forces the model to learn precise compositional and logical structure that pure prose doesn't enforce. Empirically, models trained with 5-15% code data score higher on text-only reasoning benchmarks (GSM8K, MATH, BBH) than models trained with 0% code, even at fixed compute. Whether the mechanism is reasoning-transfer or just better tokenization coverage is unsettled.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord