Distributed Data Processing

Also known as: Spark, Ray Data, Apache Beam, distributed ETL

TL;DR

Spark, Ray Data, Beam, Dask — the frameworks that turn N nodes into one logical compute. The map-reduce mental model still rules: per-partition compute is free, cross-partition compute requires a shuffle, and shuffle is the dragon.

Distributed data processing is what you do when one machine can’t finish the job in a reasonable time, even with all the chunking and streaming tricks of single-node engineering . The frameworks — Spark, Ray Data, Apache Beam, Dask — turn N nodes into one logical compute, with one core abstraction: the partitioned dataset. The mental shift from local code is that operations split cleanly into two categories — those that run free in parallel, and those that pay for data movement.

Compute follows data. Data is sharded for parallel reads. The shuffle is the only operation that breaks both rules — and dominates the cost of any non-trivial workload.

The map-reduce mental model

Every modern distributed engine — Spark, Ray Data, Beam, Dask, even DuckDB’s distributed mode — is at heart a map-reduce engine with better ergonomics. The two primitives that matter:

The two operations

Per-partition compute (map, filter, project). Each worker reads its partition, runs the transform on its rows, writes its partition. No data moves between workers. Bandwidth scales linearly with cluster size.
Cross-partition compute (shuffle, reduce, join, group-by). Each worker must see rows from other workers’ partitions to produce its output. Data moves over the network. The shuffle is the dominant cost.

The framework’s job is to turn your high-level code (df.groupBy("url").count()) into a DAG of these primitives, schedule them, handle failures, and write the result. Your job is to write the code such that the DAG has as few shuffles as possible and each per-partition computation is reasonably balanced.

Shuffle is the dragon

Why shuffle is so expensive, mechanically:

A non-shuffle stage transfers bytes (everyone reads their partition). On a cluster of workers, each worker reads bytes from storage. Linear scaling.
A shuffle stage transfers bytes across the network: every row goes from the worker that produced it to the worker that owns its destination key. Bandwidth is now bottlenecked by the cluster’s bisection bandwidth (the slowest link between any two halves of the cluster).
On a 100-node cluster with 25 Gbps NICs, shuffling 10TB is ~30 minutes of pure network time, before any compute. Real shuffles also serialize, hash, sort, and merge — all of which add CPU and disk I/O on top.

The pragmatic version: a stage with no shuffle finishes in minutes per TB. A stage with one shuffle finishes in tens of minutes. A query with three shuffles can run for hours on the same data.

Data locality and partition independence

Two ideas underlie most of the practical guidance for distributed processing.

Compute goes to data, not the reverse

In single-node code, you load data into memory and run code over it. In distributed code, the data is already on disk somewhere — typically Parquet shards on S3 or HDFS — and you ship code to the workers that have efficient access to those shards. Spark, Ray, and Beam all schedule tasks preferentially onto workers that already have the relevant data cached locally; this is the whole reason a cluster has any point at all.

The corollary: every transform should be expressible as “read these shards, compute, write these shards.” If your transform requires data from elsewhere, that’s a shuffle, and it costs.

Partition independence is free; cross-partition is expensive

If your computation can be expressed as “do X to each row independently” (map, filter, select, withColumn), it parallelizes across partitions for free. If it can be expressed as “do X to each partition independently” (mapPartitions), same thing — even if X is stateful within a partition.

Anything that requires comparing rows from different partitions — joins, group-bys, sorts, distinct, top-k — needs a shuffle to bring the relevant rows together first. The framework handles the mechanics; you pay the cost.

What's free vs what shuffles

Free: filter(text_length > 100), withColumn("score", classifier.predict(text)), select("text", "url"), per-partition tokenization, per-partition PII redaction, per-partition language detection.
Shuffles: join(other, on="url"), groupBy("domain").agg(count()), orderBy("score").limit(1000), dropDuplicates(subset=["minhash"]), computing global percentiles, building an inverted index.

Partitioning strategies

The partition key — what determines which rows go to which worker — is the most important tuning knob in distributed processing. Get it right and your shuffles are minimal; get it wrong and every transform is a 10TB shuffle.

Partitioning patterns that earn their keep

Hash partitioning by the join key. If you’ll join documents with url_metadata on url, write both tables hash-partitioned on url. The join becomes a per-partition operation — no shuffle.
Range partitioning on a sort key. For top-k, percentile, or sorted-output workloads, partition by ranges of the sort key. Each partition handles a range; merging is cheap.
Domain partitioning for dedup. For per-domain dedup of Common Crawl , partition by domain hash. Most duplicates within a domain end up co-located; per-partition dedup catches them with no shuffle.
Time partitioning for streaming. For data that arrives over time (logs, events, crawl snapshots), partition by date. New data only writes to recent partitions; old partitions are immutable.

The flip side: bad partition keys produce skew — one partition has 90% of the data, one worker carries the whole job, the cluster idles. Skew is the second-most-common reason production distributed jobs underperform their theoretical capacity. Mitigations include salting (append a small random suffix to high-cardinality keys, then aggregate twice), per-partition sub-sampling, and explicit re-partitioning before skew-prone stages.

The major frameworks

Spark

The senior framework, dating to 2009. Optimizes SQL-shaped ETL over petabyte data. Strong query planner (Catalyst), columnar execution (Tungsten + Photon), tight Parquet integration, mature shuffle infrastructure. Weak points: per-row Python UDFs are slow (PySpark serializes to and from JVM); GPU support is bolted on; the cluster manager (YARN/Mesos/Kubernetes) is operationally heavy.

Use Spark when your workload is “SELECT, JOIN, GROUP BY, write Parquet” at petabyte scale. FineWeb’s curation pipeline , RedPajama’s, and most enterprise data warehouses run on Spark. Databricks is the managed flavor; Delta Lake is the standard storage layer.

Ray Data

Ray Data is the youngest of the four (~2021). Built on top of Ray’s actor framework, it treats tasks as first-class Python with a GPU-aware scheduler. Per-row Python is fast (no JVM round-trip), GPU stages compose naturally with CPU stages, and the actor model handles streaming inference cleanly.

Use Ray Data when your workload includes heavy Python, custom models, or GPU inference inside the data pipeline — model-based quality classification over streaming Parquet , batch embedding generation, custom tokenization with Hugging Face transformers. Anyscale is the managed flavor; the OSS Ray cluster is also widely used.

Apache Beam (Dataflow)

Beam is a unified programming model for batch and streaming. The same pipeline runs on Dataflow (Google Cloud), Flink, Spark Runner, or others — Beam abstracts the engine. Powerful for streaming-heavy workloads with windowing and late-data handling.

Use Beam when you’re already on GCP (Dataflow is the managed runner and is very good), when you need unified batch + streaming semantics, or when your team values portability across runners. Less common outside Google’s ecosystem.

Dask

Dask is “NumPy and Pandas, distributed.” If your code is pandas.DataFrame.apply(...) and you want to run it on a 1-10TB dataset across a small cluster, Dask is the lowest-effort answer. Lazy evaluation, similar API surface, and decent integration with the PyData stack.

Use Dask when the data fits a small cluster (1-10TB), the workload is array/dataframe-shaped, and you want minimal API rewrite from single-node Python. Coiled is the managed flavor. Past ~10TB, Dask’s overheads (Python-driven scheduler, fewer query optimizations) start to hurt and Spark or Ray pulls ahead.

Practical examples at scale

Common workloads and where the shuffles live

Deduplicating Common Crawl across petabytes. Hash-partition by document MinHash band. Within each partition, find documents sharing band signatures. The shuffle is once (assigning each document to its bands’ partitions); per-partition comparisons are local. FineWeb’s MinHash dedup runs roughly this way on Spark.
Computing token counts on a sharded pretraining corpus. Per-partition map to tokenize and count; per-partition aggregate; final shuffle to a single partition for the global sum. Two stages, only the last is a (cheap, small) shuffle.
Building inverted indexes (term → docs). This is map + reduce in the textbook sense: per-partition emit (term, doc_id) pairs; shuffle by term; per-term aggregate doc_ids. The shuffle is exactly the size of the term-document matrix, which is enormous — inverted-index construction is a famously shuffle-heavy workload.
Per-domain quality scoring of a web crawl. Partition by domain hash; per-partition fit a domain-specific model or apply heuristics. No shuffle if data is already domain-partitioned at write time. This is why Parquet with domain-partitioned writes pays off — every downstream per-domain analysis becomes shuffle-free.

A join takes two tables A and B and produces rows where A.k = B.k. To compute it, every row of A with key k must end up co-located with every row of B with key k. If A and B aren’t already partitioned by k, the engine has to shuffle them — hash both tables on k, send rows to the worker that owns their hash bucket, then run a per-partition local join.

Two optimizations matter:

Broadcast join. If one table is small (rule of thumb: under a few GB), don’t shuffle it — broadcast the whole table to every worker and stream the large table past it. No shuffle of the large table. Enormously faster when applicable.
Co-located join. If both tables were written hash-partitioned by k, the partitions are already aligned; the join is local. This is why every mature data lake has tables physically partitioned by their dominant join key.

Both tricks are about avoiding the shuffle. Most “tune your Spark job” advice ultimately reduces to one of these two ideas.

Modern distributed engines all use lineage-based recovery: the framework tracks which partitions of which intermediate datasets each worker produced, and re-runs the failed task on a different worker. The DAG is the recovery plan — if task 4827 of stage 3 fails, the framework re-runs the chain read_input(shard_4827) → filter → tokenize → write_output(shard_4827) on a healthy worker.

This works because stages are partition-independent: re-running task 4827 doesn’t require re-running tasks 4826 or 4828. The exception is shuffle stages, where a failure can require re-shuffling some upstream data. Spark’s adaptive query execution and Ray’s lineage-based fault tolerance both add sophistication around this — for shuffle-heavy workloads, the framework persists shuffle outputs to disk so a downstream failure doesn’t re-trigger the whole shuffle.

The practical implication: at multi-TB scale, you cannot afford to write transforms that aren’t deterministic given their inputs. Non-determinism breaks lineage recovery — the re-run produces different output than the original. The whole framework’s reliability rests on idempotent, deterministic per-partition compute.

What you internalize

Distributed data processing rewards a specific way of thinking. Three habits separate engineers who ship at petabyte scale from those who don’t:

You read code by counting shuffles. Look at a Spark or Ray pipeline; identify each groupBy, join, orderBy, distinct. That’s your performance budget. Anything else is free.
You partition for the dominant query. Write data once, with a partition key that minimizes shuffle for the most common downstream access pattern. Re-partitioning on the fly is what slow pipelines do.
You measure stage time, not job time. A 4-hour job with one 3-hour stage is fixable; a 4-hour job with a hundred 2-minute stages is harder. Drill into the slowest stage; almost always it’s a shuffle, and almost always it’s because of a bad partition key or skew.

Combined with streaming reads and columnar storage , distributed data processing is what makes Common Crawl -scale work possible. None of the open pretraining ecosystem — FineWeb, RedPajama, DCLM — exists without these frameworks doing the heavy lifting in the background.

Go further

Spark vs Ray Data vs Beam vs Dask — which one when?

Spark for SQL-shaped ETL on petabytes — joins, group-bys, aggregations, columnar transforms. Ray Data when your stages include heavy Python (per-row models, GPU inference, custom tokenization) — the scheduler is GPU-aware and tasks are first-class Python. Beam (via Dataflow on GCP) when you want unified batch + streaming and you're already in Google's ecosystem. Dask when you want a NumPy/Pandas API across a small cluster and your data fits 1-10TB. Polars (single-node, multi-core) when you can fit on one big box — usually faster than any distributed framework below ~1TB.

Data engineering at scale Parquet

Why is shuffle so expensive specifically?

Shuffle is the only operation that requires every worker to send data to every other worker over the network. A non-shuffle stage reads its partition, computes locally, and writes its partition — bandwidth is data_size. A shuffle stage reads its partition, sends each row to the worker owning that row's hash bucket, receives rows from every other worker, and writes its partition — bandwidth is data_size × N and the network becomes the bottleneck. On a 100-node cluster with 25 Gbps NICs, shuffling 10TB takes ~30 minutes of pure network time even with no compute. Joins, group-bys, sorts, and dedup all shuffle.

Deduplication Data engineering at scale

What's the difference between data parallelism and task parallelism here?

Data parallelism: same code, different partitions of the data, run in parallel. Most distributed processing is this — every worker runs the filter/map/aggregate code on its slice. Task parallelism: different code, possibly with dependencies, run as a DAG. Both Spark and Ray support both, but the distinction matters because the unit of failure recovery is different: data-parallel tasks can be retried in isolation; task-parallel DAGs require lineage tracking to know what to re-run when an upstream dependency fails. Modern engines do both, with shuffle as the boundary between data-parallel stages.

Streaming datasets Common Crawl

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs