Also known as: web crawling, scraping, spidering, harvesting
TL;DR
The engineering pipeline for harvesting text data from the public web — crawlers, robots.txt, JS rendering, deduplication-as-you-go, rate limits, and politeness.
Web scraping is the engineering of harvesting text from the public web at scale. For LLM data work it splits cleanly into two regimes: general-purpose archive crawls (Common Crawl, Internet Archive) that feed broad pretraining corpora , and targeted scrapes that go after specific sites or domains where the broad crawl is insufficient. The engineering stack is largely shared; the distinction is intent and scope.
The pipeline
A complete scraping pipeline is more than a requests.get loop. The stages, in order:
Web scraping pipeline stages
Seed and frontier management. Start URLs and a queue of URLs-to-visit. Deduplicate the frontier; bound it; persist it across restarts.
Politeness gates. robots.txt parsing and enforcement; per-host rate limits; back-off on 429/503; identifiable User-Agent string with contact info.
Fetching. HTTP(S) requests with sensible timeouts, retries, redirect handling. For JS-heavy sites, headless browser (Playwright/Puppeteer) — slower but necessary.
Content extraction. Boilerplate stripping (trafilatura, resiliparse, readability-lxml) — turns HTML into clean text. Strip nav, sidebars, footers, ads, comment sections.
Inline deduplication. Hash content as it lands; reject duplicates before persisting. Saves storage and downstream pipeline cost.
Storage. WARC for archival; Parquet for downstream analytics; columnar for ML pipelines.
Frontier expansion. Extract outbound links from each fetched page; filter (same-host, same-domain, depth-bounded); add to queue.
Where production scrapers fail: the politeness gates and the inline dedup. A naive parallel scraper without rate limiting will get IP-banned within hours. A scraper without inline dedup will waste 80% of its capacity refetching duplicate content from session-IDed URLs and pagination.
The split: archive crawls vs. targeted scrapes
Common Crawl is an archive crawl — it crawls broadly with low depth per site, optimized for breadth across the public web. CC is the right tool when you want a representative sample. It’s the wrong tool when you need a specific site, recent content, JS-rendered pages, or comprehensive depth on one domain.
Targeted scrapes flip every choice: limited host list, deep crawling within those hosts, custom extractors that understand site-specific structure, frequent recrawls for freshness. Most academic/scientific corpora (arXiv, PubMed, Semantic Scholar), domain-specific legal corpora (CourtListener, EUR-Lex), and curated long-form sources (Project Gutenberg, OpenStax) come from targeted scrapes — not from CC.
robots.txt and politeness
robots.txt is a per-host file declaring which paths a crawler may visit and at what rate. It is not legally binding in most jurisdictions, but ignoring it will get you blocked technically (IP bans, Cloudflare challenges) and reputationally (named-and-shamed in admin forums). Major archives — CC, Internet Archive, GoogleBot — respect robots.txt rigorously.
The polite-scraping rules of thumb:
Identify your crawler with a User-Agent that includes a contact URL or email.
Default to ≤1 request per second per host. Drop to 0.1-0.5 req/s for small sites.
Honor Crawl-Delay directives in robots.txt.
Back off exponentially on 429 (rate limited) and 503 (overloaded).
Don’t crawl during the site’s local business hours if avoidable for non-time-sensitive data.
These rules are how an unfunded crawler stays accessible. Sites that block CC are typically sites that have been hammered by impolite crawlers in the past.
Most major sites in 2026 are React/Vue/Svelte SPAs that render content client-side from API responses. A naive requests.get returns an empty <div id="root"> and no content. The fix is a headless browser (Playwright/Puppeteer/Chromium) that runs the JS, waits for content, and dumps the rendered DOM.
The cost is roughly 10-50× per page — both in CPU/memory (Chromium per worker) and in latency (waiting for network + execution + idle detection). For a million-page crawl that’s the difference between an afternoon and a week, plus 50× the infrastructure bill.
Mitigations: (1) try the static HTML first; many SPAs server-render the first paint enough that you don’t need JS execution. (2) Hit the underlying API directly when you can reverse-engineer the XHR pattern — the JSON response is typically smaller and faster than the rendered HTML. (3) Use lightweight rendering proxies (Splash, ScrapingBee) that pool browser instances. (4) Sample-and-classify: render 1% to detect whether JS rendering is necessary for this host, then choose static or rendered for the bulk.
Common Crawl has been adding partial JS rendering since 2020 but still misses a lot of dynamic content; “is this site CC-friendly?” reduces in practice to “does this site server-render enough text?”.
Inline dedup catches duplicates as they’re fetched and rejects them before they enter storage. The check is typically a content hash (SHA-256 of normalized text) plus a URL-canonicalization check — same text under different URL parameters is a single document.
Inline dedup runs at fetch time and is fast (one hash table lookup per page) but only catches exact duplicates. MinHash near-duplicate detection requires comparing each document against the corpus and is too expensive to run inline at high crawl rates; it’s a post-hoc batch step.
The split in production: inline catches the cheap exact duplicates (same content republished, session-IDed pagination, mirror sites) which are 60-80% of the duplicate volume; post-hoc MinHash catches the harder near-duplicates (paraphrased reposts, syndicated articles, boilerplate-shared template pages) which are 10-30%. Skipping either step means either wasting fetch capacity (no inline) or shipping a noisier corpus (no post-hoc).
What scraping doesn’t solve
Web scraping retrieves what’s on the public web. It doesn’t solve licensed long-form (books, academic journals), behind-paywall content, real-user conversations, or proprietary internal corpora. Frontier labs supplement web scrapes with licensed deals (Reddit, AP, Stack Exchange, publishers) and proprietary synthetic data . The web is the floor, not the ceiling, of pretraining input.
Go further
When do I scrape myself vs. just use Common Crawl?
Use Common Crawl when you want a broad sample of the public web for pretraining or large-scale analysis. Scrape directly when you need (a) a specific site that CC undercrawls, (b) JS-rendered content CC misses, (c) authenticated content, (d) freshness CC can't deliver, or (e) targeted niche corpora (academic, legal, medical, internal docs).
Playwright or Puppeteer for JS-heavy sites; httpx + selectolax for static HTML at high throughput; trafilatura or resiliparse for boilerplate stripping. Scrapy remains the standard framework for medium-scale targeted crawls. For massive crawls, Apache Nutch or custom infrastructure on top of asyncio. Cloudflare workers and headless-browser farms (Browserless, ScrapingBee) for sites that aggressively block.
Respect robots.txt as a hard constraint, even when not legally required in your jurisdiction. Rate-limit per host (1-5 req/s typical, slower for small sites). Identify your crawler with a contact-able User-Agent. Don't bypass paywalls or login walls. For anything commercial, run terms of service past a lawyer; the legal landscape (hiQ v. LinkedIn, NYT v. OpenAI) is unsettled.