Data Parallelism

Also known as: DDP, DistributedDataParallel, data-parallel training

TL;DR

Replicate the model on every GPU, shard the batch across replicas, and synchronize gradients with an allreduce after each backward pass. The simplest distributed-training pattern, and the default for any model that fits on a single device.

Data parallelism is the simplest way to train a neural network on more than one GPU. Replicate the full model on each device, split the global batch across the replicas so each one processes a different shard, run forward and backward locally, and then synchronize gradients with an allreduce so every replica steps the optimizer with the same averaged gradient. After the step, all replicas hold identical weights again, ready for the next batch.

It is the default distributed-training pattern, the easiest to reason about, and the one most production training jobs reach for first. Its single hard constraint: each replica must be able to hold the full model, its gradients, and its optimizer state. When that stops being true, you graduate to FSDP or to tensor / pipeline parallelism.

The mechanical loop

Each step on a data-parallel rank does the same work as single-GPU training plus one collective:

Pull a per-rank shard of the global batch from the input pipeline.
Forward pass through the local replica to get a per-rank loss.
Backward pass with backpropagation , accumulating per-rank gradients.
Allreduce the gradient tensors across all ranks. Every rank ends up holding the averaged gradient.
Optimizer step using that averaged gradient. Because the pre-step weights and the gradient are identical across ranks, the post-step weights are too.

The only network traffic is the gradient allreduce. There is no parameter or activation traffic, which is what makes DP both simple and bandwidth-efficient relative to the alternatives.

You could in principle have every rank send its raw gradient to every other rank and average locally — that is just an all-gather followed by a mean. Allreduce fuses the two into one collective with provably lower bandwidth. The dominant implementation is ring-allreduce: each of ranks sends and receives parameter-bytes total, which is asymptotically optimal for a homogeneous topology.

NCCL on NVLink-connected GPUs hits close to peak NVLink bandwidth on this collective. Across a cross-node fabric (InfiniBand, RoCE), it is still close to optimal but the absolute bandwidth is much lower, which is why scaling DP from 8 GPUs (one node) to 64 GPUs (eight nodes) usually shows a step-down in scaling efficiency.

What “global batch” actually means

In single-GPU training, batch size is unambiguous: the number of samples per optimizer step. With data parallelism, three numbers are floating around:

Per-GPU batch. What each replica’s forward pass sees.
Global batch. Per-GPU batch x number of replicas x gradient-accumulation steps. This is what the optimizer actually steps on.
Microbatch. When pipeline parallelism enters the picture, an additional inner split. Not relevant for pure DP.

Every published batch-size in a training paper refers to the global batch. Hyperparameter transfer rules — linear LR scaling, square-root scaling, warmup length — also key off global batch. Confusing per-GPU and global batch is the single most common reason a multi-GPU run “works but underperforms” relative to its single-GPU control.

Where DP wins

When data parallelism is the right tool

Model fits comfortably on one GPU (under ~30B params at bf16 with Adam, or proportionally smaller with sharding-free optimizer states).
You want to scale throughput, not unlock new model sizes.
Your compute is intra-node (NVLink) — gradient allreduce is cheap there.
You are fine-tuning a small or medium model and want a stable baseline before reaching for FSDP.

For these cases, PyTorch’s DistributedDataParallel and JAX’s pmap / pjit over the batch axis are the obvious defaults. They overlap gradient communication with backward computation (DDP buckets gradients and fires off the allreduce as soon as a bucket is full), so the comms cost is largely hidden behind backward FLOPs at moderate scale.

When DP is the wrong tool

The other failure mode is cross-node bandwidth. If you are spread across a fabric that is much slower than NVLink (PCIe Gen4 between nodes, or any non-RDMA Ethernet), the gradient allreduce eventually dominates step time and scaling efficiency falls below 70%. At that point sharding strategies that trade comms for memory — FSDP with hybrid-shard, ZeRO-3 — pay off even before you actually need their memory savings.

Data parallelism is not exciting, and that is the point. It is the well-understood floor that everything else in the parallelism stack composes on top of.

Go further

Why does data parallelism stop working at frontier scale?

Each rank holds a full copy of parameters, gradients, and optimizer state. For a 70B model in mixed precision with Adam, that is roughly 70B params plus 70B grads plus 140B optimizer state — about 1.4 TB at fp32 master weights. No single GPU has that. FSDP and tensor / pipeline parallelism exist to shard those copies across devices.

FSDP Mixed-precision training

What does the allreduce actually cost?

Per step, every rank ships its full gradient tensor and receives the averaged result. Volume is roughly 2 x (1 - 1/N) x parameter_bytes per rank with ring-allreduce — close to 2 x param_bytes for large N. On a 7B model that is ~28 GB of comms per step. NVLink within a node hides it; cross-node PCIe / InfiniBand often does not, which is why intra-node DP scales cleaner than cross-node.

Throughput

How does the global batch size relate to per-GPU batch?

Global batch = per-GPU batch x number of replicas x gradient-accumulation steps. The effective learning rate usually scales with global batch (linear scaling rule, sqrt scaling, or LARS / LAMB at very large scale). Pretending global batch is the same as per-GPU batch is the most common DP debugging mistake.

Batch size Learning rate

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs