Common Crawl

Also known as: CC, CommonCrawl, CC dump, WARC archive

TL;DR

Common Crawl is the largest open web crawl — roughly 250 billion pages across 100+ monthly snapshots, distributed as WARC files. It's the raw material almost every public LLM corpus is built from.

Common Crawl is a non-profit web archive that has been crawling the public web monthly since 2008. The cumulative archive is on the order of 250 billion pages totaling tens of petabytes of raw HTML, distributed as WARC files on AWS S3 free for anyone to download. Almost every open LLM pretraining corpus — C4, RefinedWeb, FineWeb, RedPajama, DCLM — starts from Common Crawl. The frontier labs use it too; their public statements imply CC is the bulk of their web-text input as well.

Format and scale

Each monthly “crawl” is a snapshot — typically 3-4 billion pages, 250-400TB of WARC, gathered over a few weeks. Common Crawl publishes three derived formats:

WARC — full HTTP responses, including headers and raw HTML. The canonical input to serious pretraining pipelines.
WAT — metadata extracts (links, titles, language detection) without document bodies. Useful for graph analysis.
WET — naive plaintext extracts. Fast to consume but low quality; almost no production pipeline uses WET directly.

Snapshots overlap heavily — the same page is recrawled across many months, and CC does not deduplicate across snapshots. Cross-snapshot dedup is the first step of any serious filtering recipe and removes 60-80% of the bulk.

Why the raw archive isn’t usable as training data

Raw Common Crawl is overwhelmingly noise from a pretraining perspective: navigation chrome, ad copy, SEO spam, generated boilerplate, broken HTML, machine-translated mush, and outright pornography. Training on raw CC produces a markedly worse model than training on a fraction of the data after filtering.

The standard pipeline (broadly the same across FineWeb , RefinedWeb, DCLM) runs roughly:

HTML to text. Boilerplate stripping (trafilatura, resiliparse). Naive text extraction loses 30-50% of content quality vs. good extractors.
Language identification. Keep target languages only (typically English at 60-90% of token budget plus selected non-English).
URL filtering. Block adult-content lists, low-quality TLDs, known-bad domains.
Quality heuristics. Repetition ratio, mean word length, fraction of stopwords, line-length distribution. Cuts 30-60% of remaining pages.
Model-based filtering. Classifier (FastText or BERT-class) trained on a high-quality reference corpus (e.g. Wikipedia + textbooks) scores each page; keep top decile or quartile.
Near-duplicate removal. MinHash or SimHash across the survivor set. Removes another 30-50%.

Survival rate end-to-end is typically 1-5% of the original byte volume. That sounds wasteful; it’s actually the recipe.

Common Crawl Foundation is intentionally a raw archive. Its purpose is to provide the substrate on which researchers and engineers build downstream artifacts. Cleaning is opinionated — every filtering choice (which languages, which quality bar, which content categories to drop) reflects a hypothesis about downstream use. CC stays neutral and lets the community ship multiple curated variants on top of the same base.

This is why FineWeb, RedPajama, and DCLM exist as separate datasets despite all consuming the same upstream snapshots: each is a different filtering recipe. The competition between them — better filters, better deduplication, better quality classifiers — has been the fastest-moving area of open pretraining-data research over the past two years. CC is the substrate; the recipes are the contribution.

Coverage is broad but heavily biased toward what CC’s crawler can actually reach. The biases that matter:

Behind-paywall content (NYT, WSJ, academic journals, most books) is largely absent. Frontier labs license these separately.
JavaScript-heavy sites were undercrawled until ~2020; CC now does some headless rendering but still misses dynamic content from many modern web apps.
Logged-in content (Discord, Slack, Facebook, most of Reddit’s value) is invisible.
Recent content has a publication-to-crawl lag of weeks to months; the most recent snapshot is rarely current.
Non-English coverage is real but uneven — well-resourced languages (French, German, Spanish, Chinese, Japanese) are deep; long-tail languages (Yoruba, Quechua, Burmese) are thin.

For RAG and search use cases, CC is also a poor stand-in for “the live web” — it’s a months-old snapshot, not a real-time crawl. Production search engines maintain their own crawl infrastructure for that reason.

Why CC matters for the open ecosystem

Without Common Crawl there is no open pretraining ecosystem at this scale. Every open-weight 7B+ model — Llama, Mistral, OLMo, Qwen — was trained on a corpus distilled from CC. Closed labs use CC too but supplement with licensed and proprietary data. The CC-derived web bulk is the great equalizer; the gap between open and closed has narrowed largely because open recipes (FineWeb-Edu, DCLM-Baseline) closed the filtering-quality gap on the same upstream substrate.

The forward question is whether CC’s crawling rate keeps up with the web’s volume of LLM-generated content — itself a contamination concern, since training on CC trained on LLM output is the textbook recipe for distributional collapse.

Go further

How much usable text does Common Crawl actually contain?

A single monthly snapshot is ~3-4 billion pages and ~400TB of WARC. After boilerplate stripping, language filtering, and deduplication, roughly 5-10% of the bytes survive as usable English text — call it 1-3 trillion tokens per snapshot. FineWeb (15T tokens) was distilled from 96 snapshots; the survival rate end-to-end is well under 1%.

FineWeb Deduplication

What's a WARC file?

Web ARChive — the open ISO standard for storing crawled web responses. Each WARC record contains an HTTP response (status, headers, body) plus crawl metadata. Common Crawl publishes WARC, WAT (metadata extracts), and WET (plain-text extracts); most pretraining pipelines start from WARC because the WET extraction is mediocre.

Web scraping

Is everything on Common Crawl actually scraped legally?

Common Crawl Foundation respects robots.txt and operates a public, non-commercial archive — that's the basis under which most sites tolerate it. Whether downstream LLM training on those archives constitutes fair use is the active 2024-2026 litigation question (NYT v. OpenAI, Authors Guild, Getty); CC itself is not the defendant but is the upstream source.

Web scraping

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs