Data
The corpora, curation, and quality decisions that make models possible.
Training data is the single largest determinant of model quality, and most of what's interesting about modern AI traces back to choices made at the data layer. The concepts below cover the canonical pretraining corpora (Common Crawl, FineWeb, the closed-source frontier datasets), the curation steps that separate good data from web sludge (deduplication, quality filtering, contamination checks), and the human/programmatic labeling layers (RLHF preferences, weak supervision, synthetic data) that produce post-training signal. If you're building a model, an embedding, or a reranker, you'll spend more engineering hours here than anywhere else.
- Common Crawl
Common Crawl is the largest open web crawl — roughly 250 billion pages across 100+ monthly snapshots, distributed as WARC files. It's the raw material almost every public LLM corpus is built from.
- Data Augmentation
Synthetic perturbation of training examples to expand a dataset's effective size — paraphrase and back-translation for text, rotation and crop for images, in-batch contrastive views for embeddings.
- Data Contamination
When evaluation data leaks into training data, inflating benchmark scores without improving real capability. Detected via n-gram match against eval sets, log-probability attacks, or membership inference.
- Data Curation
The umbrella discipline of preparing a corpus for training: filtering, deduplication, quality scoring, language ID, and classifier-based selection.
- Data Engineering at Scale
What changes when your dataset doesn't fit on one machine, doesn't fit in RAM, and takes hours per pass. The thresholds where ad-hoc Python stops working: single-file → sharded, RAM → streaming, single-node → distributed.
- Data Formats
Data formats are the contracts between memory and disk: how a structured record turns into bytes that can be read back later. Choice of format determines whether you can scan a TB in a second or an hour, whether schema evolution breaks readers.
- Data Labeling
Human-in-the-loop annotation of training data — crowdsourced (Mechanical Turk, Scale, Surge), expert (domain specialists), and gold-standard sets. Distinct from RLHF preferences.
- Data Mixing
The ratio decisions in a pretraining corpus — what fraction of web vs code vs math vs books vs scientific papers. Second-most-important choice in pretraining after corpus selection itself.
- Dataset Cards
Structured metadata describing a dataset's provenance, license, size, intended use, limitations, and ethical considerations. The HuggingFace dataset-card schema is the de facto standard, and every shipped dataset should have one.
- Deduplication
Removing exact and near-duplicate documents from a training corpus. Exact dedup uses content hashing; near-dedup uses MinHash, SimHash, or embedding-based similarity.
- Distributed Data Processing
Spark, Ray Data, Beam, Dask — the frameworks that turn N nodes into one logical compute. The map-reduce mental model still rules: per-partition compute is free, cross-partition compute requires a shuffle, and shuffle is the dragon.
- FineWeb
FineWeb is a 15-trillion-token English web corpus released by HuggingFace in 2024, distilled from 96 Common Crawl snapshots through an aggressive filtering and deduplication recipe.
- JSONL
JSONL — JSON Lines, also called NDJSON — is one JSON object per line. Brutally simple, ubiquitous in ML datasets and log shipping, friendly to streaming and append-only writes.
- Parquet
Parquet is a columnar on-disk format that has become the only reasonable way to store multi-TB datasets. Rows are grouped into chunks; each chunk stores columns separately, compressed, with statistics.
- Pretraining Corpus
The trillion-token text mixture that an LLM consumes during pretraining. Composition (web, code, books, math, scientific papers) and mixing ratios are the two highest-leverage choices in pretraining.
- Streaming Datasets
When your dataset doesn't fit on local disk, you stream it from object storage as tar or Parquet shards. Sequential reads of large objects are 10-100x faster than random access on S3, so streaming formats.
- Weak Supervision
Programmatic labeling: write rules, heuristics, and labeling functions, aggregate them into noisy labels for a model to denoise. The Snorkel paradigm.
- Web Scraping
The engineering pipeline for harvesting text data from the public web — crawlers, robots.txt, JS rendering, deduplication-as-you-go, rate limits, and politeness.
- Foundations 48
The bedrock primitives every other topic builds on.
- Language Models 32
The foundational substrate of modern AI.
- Multimodal 13
When text isn't the only signal — vision, audio, and joint embedding spaces.
- Prompting 16
How you talk to an LLM, and when you stop.
- Agents 12
When LLMs become decision-makers in a loop.
- Search & Retrieval 21
How systems find relevant documents in the first place.
- Embeddings 16
The dense-vector layer of modern retrieval.
- Rerankers 9
The second stage that puts the right answer at the top.
- Evaluation 21
How to measure retrieval quality and trust the numbers.
- Training Methodology 21
How modern retrieval models get their relevance signal.
- Performance Engineering 25
Squeezing throughput, latency, and memory out of GPUs.
- Production 16
From notebook to live traffic.
