FineWeb

Also known as: FineWeb-Edu, HuggingFace FineWeb, FW

TL;DR

FineWeb is a 15-trillion-token English web corpus released by HuggingFace in 2024, distilled from 96 Common Crawl snapshots through an aggressive filtering and deduplication recipe.

FineWeb is HuggingFace’s 15-trillion-token English-language web corpus released in May 2024, distilled from 96 monthly Common Crawl snapshots (2013-2024). It is the largest fully-open, fully-reproducible pretraining web corpus in 2026, and it has displaced C4 and RefinedWeb as the default open baseline. Llama-3 family, OLMo-2, and most academic 7B+ models trained after mid-2024 use FineWeb or a close derivative.

Why FineWeb became the default

Before FineWeb, the open pretraining-data landscape was a mess. C4 (305GB, 2020) was tiny by frontier-scale standards and filtered naively. The Pile (825GB, 2020) was diverse but full of contamination and licensed data of unclear status. RefinedWeb (5T tokens, 2023) was a step change but partially closed — Falcon released the recipe but only ~600B tokens of the dataset. RedPajama-V2 reproduced the Llama-1 recipe but lagged on quality.

FineWeb did three things at once: published the full 15T tokens openly, documented every filtering decision with ablations, and shipped FineWeb-Edu — a quality-classifier-filtered variant whose 1.3T tokens beat the full 15T raw set at fixed compute. The reproducibility plus the ablations plus the quality variant is why every open lab in 2024-2025 standardized on it.

The filtering recipe at a high level

The FineWeb pipeline is one of the most documented in open NLP — the team published HuggingFace blog posts walking through every ablation. The recipe in broad strokes:

FineWeb pipeline stages

Text extraction. Trafilatura on raw WARC (not WET — WET drops too much content).
Language filter. FastText language ID; keep English (>0.65 confidence).
URL blocklist. UT1 list for adult content; additional manual blocklist for spam and low-quality domains.
Repetition heuristics. Reject documents with >30% repeated lines, >20% repeated 5-grams, etc. (Gopher-style filters from DeepMind).
Quality heuristics. Mean line length, stopword ratio, fraction of alphabetic characters, ellipsis density. Drops boilerplate and SEO spam.
MinHash deduplication. Per-snapshot dedup at threshold 0.85 Jaccard similarity; cross-snapshot dedup applied independently. Removes ~50-70% of remaining tokens.
PII redaction. Email addresses and IP addresses anonymized.

The result is roughly 15T tokens — call it a 5-10% survival rate from the raw 96-snapshot input.

FineWeb-Edu adds a single step on top of base FineWeb: a small Snowflake-Arctic-Embed-M encoder with a linear regression head, trained on 500K Llama-3-70B-Instruct annotations of “educational value” on a 0-5 scale. The classifier scores each FineWeb document; the FineWeb-Edu subset is everything scoring ≥3 (about 1.3T tokens) or ≥2 for the larger variant.

The “educational value” framing is doing a lot of work. The Llama-3 prompt is roughly “rate how useful this would be for teaching a high-school or college student”; that biases the classifier toward textbook-like, well-structured, factually-rich prose and against forum posts, marketing copy, listicles, and SEO-padded content. The empirical result is ~5-7 point gains on MMLU and ARC at fixed compute.

The interesting failure mode: FineWeb-Edu is biased toward formal expository writing. A model trained only on FineWeb-Edu can be subtly worse at conversational and creative generation than one trained on full FineWeb, because the conversational corner of the distribution is heavily filtered out. Production recipes typically blend FineWeb-Edu (high-quality bulk) with full FineWeb (distributional diversity) and curated tails (code, math, books).

Multilingual FineWeb-2 (released late 2024) covers 1000+ languages but the per-language token counts are wildly uneven — English remains 60%+ of total tokens, Spanish/French/German/Russian have hundreds of billions each, and the long tail is in the 1-100B range. The filtering recipe is the bottleneck: FineWeb-Edu’s classifier only works because Llama-3-70B can read English text well; building equivalent quality classifiers in Yoruba or Tagalog requires either a strong multilingual teacher LLM or a per-language annotation pipeline.

Cross-lingual quality transfer (use the English classifier on translated text) doesn’t work well — the classifier picks up surface features that don’t survive translation. The frontier labs have proprietary multilingual recipes; the open ecosystem is still catching up.

What FineWeb is and isn’t

FineWeb is the web bulk of a pretraining corpus — it is not a complete training mixture. To train a competitive model you blend FineWeb (or FineWeb-Edu) with code (The Stack, GitHub), math (OpenWebMath, AlgebraicStack), books and long-form, and curated/synthetic data. The mixing ratios remain a per-lab tuning problem; FineWeb just removes the web-quality variable from the equation.

The practical impact: in 2026, claiming “trained on a frontier-scale clean web corpus” no longer differentiates an open-weight model. Everyone has access to FineWeb. The differentiation has moved upstream (proprietary licensed long-form, curated tails) and downstream (post-training, alignment, specialization). For frontier-lab competitiveness on the web bulk specifically, the open ecosystem has caught up — and FineWeb is why.

Go further

What does FineWeb-Edu actually filter on?

A small classifier (Snowflake-Arctic-Embed-M + linear head) trained on Llama-3-70B annotations of educational value on a 0-5 scale. Keep documents scoring 3 or higher. The result is roughly 1.3T tokens of high-quality educational web — a 90% reduction from raw FineWeb that consistently improves downstream benchmarks at fixed token budget.

Pretraining corpus

How does FineWeb compare to C4 and RefinedWeb?

C4 (T5, 2020) was 305GB English filtered Common Crawl — small by 2024 standards and naive on filtering. RefinedWeb (Falcon, 2023) was 5T tokens with stronger MinHash dedup. FineWeb (15T) extends RefinedWeb's recipe across more snapshots and adds quality heuristics; FineWeb-Edu adds the model-based classifier. At fixed 1B-parameter, fixed-compute training, FineWeb-Edu beats RefinedWeb by 4-8 points on average across MMLU, HellaSwag, ARC, and Winogrande.

Common Crawl

Why did the 'replace C4' moment happen in 2024 and not earlier?

Two reasons. First, MinHash deduplication at trillion-token scale only became economically tractable around 2023-2024 with cheap CPU. Second, model-based quality classifiers needed strong-enough teacher LLMs (Llama-3-70B, Mixtral) to produce reliable training labels at scale. Both unlocked the recipe roughly simultaneously, and HuggingFace shipped the first openly-reproducible version.

Deduplication Common Crawl

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs