Batch Size

Also known as: mini-batch size, global batch size, tokens per batch

TL;DR

Batch size is the number of training examples averaged into a single gradient step. Larger batches give cleaner gradients but worse generalization; smaller batches are noisier but regularize implicitly.

Batch size is the number of training examples used to compute a single gradient before taking a step. With batch size 1 you update on every example (pure SGD, very noisy). With batch size equal to the entire dataset you update once per pass (full-batch GD, infeasible at scale). Real training sits between, and the value you pick controls both gradient quality and generalization in non-obvious ways.

BATCH SIZE · GRADIENT NOISE VS CLEANLINESSSmaller B finds flatter minima. Larger B converges faster.SMALL BATCHB = 32noisy steps · generalisesFLAT MINSHARP MINLARGE BATCHB = 4096smooth steps · memorisesFLAT MINSHARP MINTRAIN VS VALIDATION LOSS · TRAINING STEPlosstraining stepTRAIN (B SMALL)TRAIN (B LARGE)val (small)val (large)generalisation gap

The basic trade-off

Larger batches give cleaner gradients. The variance of an averaged gradient over independent examples drops by a factor of , so doubling the batch halves the noise. Cleaner gradients let you take larger steps without diverging — the famous “linear scaling rule” says LR should scale roughly with batch size.

Smaller batches give noisier gradients. The noise turns out to be a feature. It biases the optimizer away from sharp minima — solutions that achieve low training loss by carefully memorizing the training set — and toward flat minima that generalize better. Empirically, smaller-batch training reliably produces models with lower test loss, even when training loss is the same.

”Noise is regularization”

This is the sound-bite version of why you can’t just crank batch size to infinity. SGD’s stochasticity acts like an implicit regularizer. Replace it with deterministic full-batch GD and you’d lose that regularization, requiring explicit replacements (more dropout, more weight decay, more data augmentation) to recover the same test performance.

The regime where this matters most is small-data deep learning — vision tasks with hundreds of thousands of examples, fine-tuning on small SFT datasets. There, batch sizes of 32, 64, or 128 are still standard and competitive with much larger configurations.

Two things change at LLM-pretraining scale. First, the dataset is enormous — trillions of tokens, often seen less than once — so overfitting isn’t the failure mode. Second, the model is in a non-overparameterized regime relative to the data; it can’t memorize even if you wanted it to. The implicit-regularization argument breaks down.

So the regime flips: at LLM scale, batch size is an engineering knob (saturate the cluster), not a generalization knob. Modern LLMs pretrain with 1M-16M tokens per global batch across thousands of GPUs — three to four orders of magnitude larger than ResNet-era 256-image batches. The “noise as regularization” wisdom from small-data settings does not translate.

It doesn’t go infinite either. Scaling-laws work identifies critical batch sizes beyond which gradient noise is negligible and further growth gives only linear compute gains, not training-time gains. Frontier labs train near (but not above) this critical batch.

What “batch size” even means in LLMs

For a transformer, “batch size” can mean several things:

  • Microbatch (per-GPU) — how many sequences fit on one GPU. Constrained by GPU memory.
  • Global batch (sequences) — total sequences across all GPUs and all gradient-accumulation steps.
  • Global batch (tokens) — global-batch-sequences times sequence length. This is the number that matters statistically.

When a Llama-3 paper or Qwen paper says “batch size 4M,” they mean 4 million tokens per gradient step, not sequences. With 8K sequence length that’s 512 sequences per step; with 32K context that’s 128. Always check the unit.

Practical defaults

LLM pretraining uses 1M-16M tokens per batch, scaled to fit the cluster. LLM fine-tuning typically uses 64K-1M tokens, with smaller working better for small SFT datasets. Embedding and contrastive training benefit directly from large batches because every other example becomes a — sizes of 8K-65K are common. Vision deep learning still mostly runs at 256-4096 images, dictated by ImageNet-era recipes.

Go further

Why is gradient noise actually useful?

Stochastic gradient noise pushes the optimizer away from sharp minima — points where small parameter perturbations cause large loss changes. Sharp minima generalize poorly because real-world test inputs perturb the loss surface. Smaller batches produce noisier gradients, which biases SGD toward flatter minima that generalize better. This is why naively cranking up batch size can hurt validation loss even as training loss improves.

Why do LLMs use such enormous batches?

At very large model and dataset scale, the implicit-regularization argument weakens — the model has so much data and capacity that overfitting isn't the bottleneck. Compute efficiency dominates instead, and a 4M-token batch fully saturates a 1024-GPU cluster. Modern LLM pretraining often uses 4-16M tokens per batch, several orders of magnitude larger than the 32-256 examples typical of pre-2018 deep learning.

What's the difference between batch size and microbatch size?

The 'batch size' that matters statistically is the global batch — the total number of examples averaged into one gradient step. The microbatch is what physically fits on one GPU. Gradient accumulation lets you sum gradients across multiple microbatches before a step, decoupling the two: microbatch is set by GPU memory, global batch is set by training dynamics.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord