Topic · 18 concepts

Data

The corpora, curation, and quality decisions that make models possible.

Training data is the single largest determinant of model quality, and most of what's interesting about modern AI traces back to choices made at the data layer. The concepts below cover the canonical pretraining corpora (Common Crawl, FineWeb, the closed-source frontier datasets), the curation steps that separate good data from web sludge (deduplication, quality filtering, contamination checks), and the human/programmatic labeling layers (RLHF preferences, weak supervision, synthetic data) that produce post-training signal. If you're building a model, an embedding, or a reranker, you'll spend more engineering hours here than anywhere else.

Other topics
ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord