Does gradient accumulation give exactly the same result as a single large batch?
For vanilla SGD, yes — the gradient is linear in the per-example loss, so summing across
Also known as: accumulation steps, micro-batch training, grad accumulation
Run multiple micro-batches sequentially, summing their gradients into a buffer, before applying a single optimizer step. Lets you simulate a large effective batch on memory-constrained hardware. The standard trick for hitting target batch sizes when a single batch won't fit on the GPU.
Gradient accumulation runs
The loop is a four-line modification of single-step gradient descent :
optimizer.zero_grad()
for k in range(K):
loss_k = forward(batch_k) / K
loss_k.backward() # adds into .grad buffers
optimizer.step()
optimizer.zero_grad()
The /K keeps the gradient an average over the effective batch, which is what most optimizers’ learning-rate calibration assumes. loss.backward() adds into existing .grad tensors via PyTorch’s backpropagation accumulator; optimizer.step() and zero_grad() fire once per
Gradient accumulation lets you hit a target effective batch size without buying more GPU memory. The trade is wallclock for memory:
The peak memory saved is the activation memory of
Two knobs control the same target batch, and they trade differently:
Pick the largest micro-batch that fits under ~80% memory utilization (leaving headroom for fragmentation), and use accumulation to close the gap to the target.
Modern fine-tuning recipes — instruction tuning, RLHF rollouts, DPO — target effective batches of 256-1024 examples, well above what naively fits. On a single 80 GB H100 with a 7B model and ~4K context, a micro-batch of 4-8 sequences is typically the ceiling; accumulating 32-256 reaches the target. Without gradient accumulation, the only path to that effective batch is multi-GPU — which doubles operational complexity for a job that otherwise fits on one box. This is why every fine-tuning library exposes an accumulation_steps flag by default.
Cleanly, in 2026. The accumulation buffer lives in fp32 (matching the optimizer’s fp32 master weights), and each micro-batch’s bf16 gradient is upcast and added on the fly. The historical gotcha was fp16-specific: loss scaling has to divide by
Gradient accumulation is the kind of trick that doesn’t appear in any architecture paper but is in every training script. The math is trivial; the role is structural — it lets a pretraining or fine-tuning recipe set its target batch independently of the hardware, and every other piece of the parallelism stack assumes you have it.
For vanilla SGD, yes — the gradient is linear in the per-example loss, so summing across
They're complementary, not competing. Data parallelism splits one global batch across
Forward + backward + accumulate is fractionally slower than forward + backward + step because of the extra add into the gradient buffer, but the dominant cost is just running