Data Engineering at Scale

Q: Where exactly does the Pandas-on-a-laptop workflow break?

Around 10-50GB of input. Below that, pd.read_csv and df.to_parquet are fine. Between 50GB and a TB you can sometimes survive with chunked reads, careful dtype management, and a beefy box — but every iteration takes minutes and a single OOM crashes the whole script. Above ~1TB it stops working entirely: data doesn't fit on local disk, single-pass scans take hours, and one bad row in row 4 billion takes down a 6-hour job. That's the threshold where you have to switch to columnar storage on object stores plus a distributed engine.

Also known as: multi-terabyte data engineering, TB-scale data work

TL;DR

What changes when your dataset doesn't fit on one machine, doesn't fit in RAM, and takes hours per pass. The thresholds where ad-hoc Python stops working: single-file → sharded, RAM → streaming, single-node → distributed.

Most data work taught in tutorials and bootcamps tops out at “load a CSV into Pandas, run some transforms, save it back.” That works up to maybe 10GB. At multi-TB and beyond — the regime where Common Crawl , FineWeb , and any serious pretraining corpus live — almost every assumption in that workflow becomes wrong. This article is about the thresholds where things break and what replaces them.

The gap between someone who’s done a Kaggle dataset and someone who’s processed Common Crawl is bigger than the gap between an undergrad and a senior engineer.

The thresholds where the rules change

There are four discontinuities that matter, and each one forces a different rewrite.

The four scale transitions

Single file → sharded. Around 10GB of input, “the dataset” stops being one file and becomes a directory of shards. You stop saying pd.read_csv(path) and start saying “iterate over the parts.” Every transform must be expressible per-shard, because no single machine reads the whole thing.
RAM → streaming. Around the size of your largest box’s memory (say 256GB-1TB), you can’t df = load_everything() anymore. Transforms become iterator pipelines: read a chunk, transform, write a chunk, drop the chunk from memory. Anything that needs the full dataset in memory at once — sort, dedup, large joins — becomes a multi-pass external operation.
Single node → distributed. Around 1-10TB, even chunked single-machine processing takes too long per pass. You add a cluster and a framework — Spark, Ray, Beam, Dask — that schedules work across nodes. Now every transform must also tolerate machines failing mid-job.
Ad-hoc → reproducible. Around the point where one full pass takes more than an hour, ad-hoc scripts become untenable. A bug found 5 hours into a 6-hour job costs you the day. You move to orchestrated pipelines with idempotent steps, deterministic outputs, and shard-level checkpointing.

These are not optional refactors you do for elegance. Each one is forced on you by an operational failure: an OOM, a 12-hour rerun, a crashed coordinator, an S3 bill that arrives like a punch.

Pain points that simply don’t exist at small scale

The small-file problem

Object stores (S3, GCS, Azure Blob) charge per request and have ~100ms of round-trip latency. A directory of one million 10KB JSON files is technically the same data as one Parquet file of 10GB. Operationally they’re not even close — listing a million files takes minutes, and reading them takes hours of wall-clock dominated by request overhead, not transfer.

The fix is to write data in shards sized for the read pattern: typically 100MB-1GB per file, in a columnar format. Compaction jobs that merge small writes into large shards are standard infrastructure on every mature data lake. Parquet with row groups around 128MB is the canonical answer.

Schema drift across petabytes

When you have a small dataset, you can eyeball its schema and re-derive the parser if anything changes. At multi-TB scale, the corpus accumulates over months from multiple sources, and the schema drifts: a field that was always int64 shows up as string in last week’s shard, a nullable column appears in the middle of a year-long crawl, a renamed field breaks every downstream consumer. You don’t notice until job 47 of 50 fails six hours in.

The defenses are: pin schemas explicitly at write time (pyarrow.schema), validate at read time, and treat schema evolution as a versioned, planned operation rather than a happy accident. Iceberg and Delta Lake exist primarily to make schema evolution transactional and queryable; without them, schema drift on a multi-PB corpus eats engineering quarters.

One bad row crashes a 6-hour job

A single malformed row in row 3.7 billion will, if your pipeline raises on parse error, take down the entire job. You’ll find out 5h 50m into a 6h run. There is no useful error message because the failure is “row index that doesn’t fit on a screen, in shard 8203 of 10000.”

Per-row processing is doomed

A loop of for row in dataset: do_something(row) cannot keep up with TB-scale data. The Python interpreter overhead alone — function call, attribute access, GC — is on the order of microseconds per row, which gives you ~1M rows/sec on a single core. A 10B-row corpus is a CPU-week per pass per worker.

The mental shift is micro-batching: every transform takes a chunk of N rows (typically 10K-100K) and processes them as a vectorized operation. PyArrow batches, Pandas DataFrames, NumPy arrays, Polars frames — pick one and commit. Per-row UDFs in Spark or Beam are the same anti-pattern in distributed clothing; they kill throughput by 10-100x compared to vectorized equivalents.

Deterministic shuffling at scale

A small dataset shuffles with np.random.shuffle(arr). A multi-TB dataset cannot — it doesn’t fit in memory. You shuffle by reordering shards (cheap) and reading shards in random order (cheap), with optional per-shard reshuffling on load (cheaper than global). Truly global shuffle requires writing every output shard from samples drawn from every input shard, which is a full pass with I/O — not free.

The deterministic part matters: training a model on a corpus needs the same shuffle seed across runs for reproducibility, but the seed must select a sequence of shard indices, not a row-level permutation. Streaming dataset formats (Mosaic StreamingDataset, WebDataset) bake this in.

Distributed deduplication

Dedup at single-machine scale is a hash set — straightforward. Dedup across petabytes is a distributed-systems problem. You can’t build a single hash table over 10B documents (it doesn’t fit on one machine), so you partition by hash prefix, shuffle each document to its partition, and dedup within partitions in parallel. MinHash-LSH adds another layer: you bucket documents by signature bands, shuffle each band to its bucket, and compare within. The shuffle is the dominant cost — see the distributed processing piece on why shuffle is the dragon.

Two reasons, both load-bearing for ML.

First, reproducibility. If you train two model runs on “the same corpus” but with different shuffle orders, the gradients see slightly different curricula and the resulting models differ in subtle ways — sometimes by 0.5-1 point on benchmarks. To compare model architectures or hyperparameters, you have to fix the data order. That requires a shuffle algorithm whose only randomness is a pinned seed, not “whatever order S3 LIST returned today.”

Second, training-time efficiency. Random-access shuffling on object storage is murder — every batch becomes N random GETs of small ranges from large files, which is the slowest possible read pattern. The standard answer is two-level shuffling: a global shard-level shuffle (cheap, just reorder which shards each worker reads) plus a per-shard buffer shuffle (cheap, in-memory ring buffer of a few thousand examples). The combination approximates a global shuffle with sequential reads, and it’s deterministic given the seed. Mosaic StreamingDataset, WebDataset, and Hugging Face’s IterableDataset all use variants of this.

Pipelines, not scripts

At small scale, “the data preparation” is one script you re-run when something changes. At scale, this collapses: one full pass takes hours, costs money, and exposes 10 different ways to fail. You move to a pipeline — a DAG of idempotent steps with explicit inputs, outputs, and checkpoints.

What a real production pipeline includes

Idempotent transforms. Each step takes inputs and writes outputs deterministically; rerunning a step with the same inputs produces the same outputs. A failed step doesn’t corrupt anything; you just rerun it.
Shard-level checkpointing. Each step writes outputs as named shards (part-00482.parquet); on restart, completed shards are skipped. The unit of recovery is a shard, not the whole job.
Manifest tracking. Every output shard is recorded with its inputs, code version, and content hash. Schema and lineage are queryable, not folklore.
Resource-aware scheduling. Dagster/Airflow/Beam/Argo schedule tasks to clusters with explicit memory, CPU, and concurrency budgets. A runaway transform doesn’t take down its neighbors.
Backfills and replays. When you fix a bug in step 4 of an 8-step pipeline, you re-run from step 4 forward — not from step 1. Lineage tracking makes this safe.

This is what teams mean by “data infrastructure.” It’s not a glamorous part of an ML org, but it’s the substrate that makes everything else reproducible.

The opinionated stack

There is no neutral choice here. The honest version of “what should I use” in 2026:

Tools that earn their keep at multi-TB and up

Storage. S3 or GCS for object storage. Parquet with zstd compression, sorted by your most-frequent filter column, row groups around 128MB. Iceberg or Delta Lake on top if you need transactions, schema evolution, or time travel.
Compute. Spark for SQL-heavy ETL on petabyte data; Ray Data for ML-shaped workloads (per-row Python, GPU stages); Beam/Dataflow for streaming + batch unification on Google Cloud; Dask for “I want a NumPy/Pandas API across a cluster.” Polars for single-node multi-core when the data fits one big box. See distributed data processing .
Orchestration. Dagster for asset-centric pipelines with strong typing and lineage; Airflow when you’re stuck with it; Argo Workflows if you’re already on Kubernetes. Pipelines should be code, in a repo, with tests.
Streaming reads for training. Mosaic StreamingDataset for ML training-grade reads; WebDataset for tar-shard CV/ASR pipelines; Hugging Face IterableDataset for everything else. See streaming datasets .

The honest framing

The mental model for small-scale data engineering is “I have a dataset and I write code that operates on it.” The mental model at multi-TB and up is “I run a long-lived service that produces a dataset, and the service must tolerate every failure mode of distributed systems.” The skill set required is closer to building a search engine than to writing a Pandas tutorial.

There’s a cottage industry of tools that promise to “scale Pandas to TB” or “make Spark feel like Pandas.” They mostly underdeliver because the underlying problem — distributed systems with partial failures, expensive shuffles, schema drift, and cost-per-GB economics — isn’t a syntax problem. It’s a mental-model problem. The teams that ship at this scale are the ones that internalize the new model rather than fighting to preserve the old one.

Go further

Where exactly does the Pandas-on-a-laptop workflow break?

Around 10-50GB of input. Below that, pd.read_csv and df.to_parquet are fine. Between 50GB and a TB you can sometimes survive with chunked reads, careful dtype management, and a beefy box — but every iteration takes minutes and a single OOM crashes the whole script. Above ~1TB it stops working entirely: data doesn't fit on local disk, single-pass scans take hours, and one bad row in row 4 billion takes down a 6-hour job. That's the threshold where you have to switch to columnar storage on object stores plus a distributed engine.

Parquet Distributed data processing

Why is the small-file problem such a big deal on S3 and GCS?

Object stores charge per request and have ~50-200ms of round-trip latency per GET. A million 10KB files take a million GETs; that's both expensive (a few dollars per LIST+GET pass) and slow (hours of wall-clock just to enumerate). The same data in 1000 files of 10MB each reads in seconds. The fix is compaction jobs that periodically merge small files into larger Parquet shards — it's one of the most common pieces of infrastructure on any mature data lake.

Parquet Streaming datasets

How do you make a multi-hour job recoverable?

Checkpoint at shard granularity. Instead of one job that processes 10,000 files, run 10,000 tasks each writing one output shard with a deterministic name like part-{shard_id}.parquet. On restart, list completed outputs and skip those shards. Combined with idempotent transforms — same input always produces the same output — this means a transient S3 hiccup costs you a single shard's work, not a 6-hour job. Dagster, Airflow, and Beam all build around this pattern.

Distributed data processing Common Crawl

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs