Gradient Accumulation

Also known as: accumulation steps, micro-batch training, grad accumulation

TL;DR

Run multiple micro-batches sequentially, summing their gradients into a buffer, before applying a single optimizer step. Lets you simulate a large effective batch on memory-constrained hardware. The standard trick for hitting target batch sizes when a single batch won't fit on the GPU.

Gradient accumulation runs micro-batches sequentially through the forward and backward pass, summing gradients into a buffer, and only stepping the optimizer at the end. The result is mathematically equivalent — for plain SGD, exactly; for Adam-class optimizers, near-exactly — to training on one batch of size , at roughly the peak activation memory. It is the standard workaround when the a recipe wants is larger than the one the GPU can physically hold.

The mechanic

The loop is a four-line modification of single-step :

optimizer.zero_grad()
for k in range(K):
    loss_k = forward(batch_k) / K
    loss_k.backward()        # adds into .grad buffers
optimizer.step()
optimizer.zero_grad()

The /K keeps the gradient an average over the effective batch, which is what most optimizers’ learning-rate calibration assumes. loss.backward() adds into existing .grad tensors via PyTorch’s accumulator; optimizer.step() and zero_grad() fire once per micro-batches rather than once per micro-batch. That is the entire trick.

Why this matters

Gradient accumulation lets you hit a target effective batch size without buying more GPU memory. The trade is wallclock for memory: micro-batches take roughly times longer per optimizer step than one big batch of the same size.

The peak memory saved is the activation memory of micro-batches you no longer need to hold simultaneously. Parameters, gradients, and optimizer state are unchanged. So accumulation buys headroom on activations specifically — the dominant memory cost for long sequences in transformer training.

The interaction with parallelism

Effective batch size formula across stacks
  • Single GPU. effective batch = accumulation_steps micro_batch_size.
  • Data parallel ( GPUs). effective batch = accumulation_steps micro_batch_size. Each rank accumulates locally; the gradient allreduce fires once per optimizer step, not once per micro-batch.
  • FSDP / ZeRO-3. Same formula — shards parameters and gradients, but the batch math is identical to DP.
  • Pretraining at scale. Llama-3 70B trained at ~16M-token effective batches with ~1024 GPUs in configuration, achieved via thousands of ranks small accumulation ~8K-token micro-sequences.

The micro-batch vs. accumulation trade-off

Two knobs control the same target batch, and they trade differently:

  • Bigger micro-batch better GPU utilization. More arithmetic intensity per kernel launch, better tensor-core fill. But each unit grows activation footprint roughly linearly — see for why this matters at the HBM level.
  • More accumulation steps less per-step memory but worse . Each micro-step pays the kernel-launch and gradient-add overhead.

Pick the largest micro-batch that fits under ~80% memory utilization (leaving headroom for fragmentation), and use accumulation to close the gap to the target.

Modern recipes — instruction tuning, RLHF rollouts, DPO — target effective batches of 256-1024 examples, well above what naively fits. On a single 80 GB H100 with a 7B model and ~4K context, a micro-batch of 4-8 sequences is typically the ceiling; accumulating 32-256 reaches the target. Without gradient accumulation, the only path to that effective batch is multi-GPU — which doubles operational complexity for a job that otherwise fits on one box. This is why every fine-tuning library exposes an accumulation_steps flag by default.

Cleanly, in 2026. The accumulation buffer lives in fp32 (matching the optimizer’s fp32 master weights), and each micro-batch’s bf16 gradient is upcast and added on the fly. The historical gotcha was fp16-specific: loss scaling has to divide by before scaling, or the scaled gradients overflow during per-micro-batch backward. With bf16 there is no loss scaling and no such gotcha; and accumulation simply compose, and the combined memory savings are why most fine-tuning recipes use both.

Gradient accumulation is the kind of trick that doesn’t appear in any architecture paper but is in every training script. The math is trivial; the role is structural — it lets a or fine-tuning recipe set its target batch independently of the hardware, and every other piece of the parallelism stack assumes you have it.

Go further

Does gradient accumulation give exactly the same result as a single large batch?

For vanilla SGD, yes — the gradient is linear in the per-example loss, so summing across micro-batches is identical to one big batch of size . For Adam-family optimizers it is almost yes — the second-moment estimates are computed once per optimizer step over a gradient that has already been averaged, so the only divergence comes from differences in how reductions associate in finite precision. At standard hyperparameters the difference is invisible.

Gradient accumulation vs data parallelism?

They're complementary, not competing. Data parallelism splits one global batch across GPUs in space (one big batch per step, sharded). Gradient accumulation splits one GPU's contribution across micro-batches in time. Real training uses both: effective batch = data-parallel size accumulation steps per-device micro-batch.

What's the throughput cost?

Forward + backward + accumulate is fractionally slower than forward + backward + step because of the extra add into the gradient buffer, but the dominant cost is just running passes vs. one. The real overhead is wallclock: a target batch of 1024 with micro-batch 32 means 32 sequential micro-steps for a single optimizer step. You traded memory for time.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord