Batch Normalization

Also known as: BN, BatchNorm, Ioffe-Szegedy normalization

TL;DR

Batch normalization standardizes each activation across the batch dimension to zero mean and unit variance, then applies a learned affine transform. Introduced by Ioffe and Szegedy in 2015, it dominated vision for years.

Batch normalization, introduced by Sergey Ioffe and Christian Szegedy in their 2015 paper “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” normalizes each activation channel using statistics computed across the batch dimension. For a tensor of shape [batch, features], BN computes a mean and variance per feature across the batch, then standardizes:

y = γ * ( x - μ_batch ) / sqrt( σ²_batch + ε )  +  β

Where and are learned per-feature scale and shift parameters. The technique transformed deep computer vision: it allowed networks like ResNet to train at far higher learning rates than were previously stable, and was a load-bearing component in nearly every CNN from 2015 to roughly 2020.

Why it worked (or seemed to)

The original paper’s framing was internal covariate shift: as upstream layers update, the distribution of inputs to downstream layers shifts, forcing them to readapt. Normalizing back to fixed mean/variance, the argument went, removes that shift.

That story is now considered partially wrong. Santurkar et al. (2018) showed BN doesn’t actually reduce internal covariate shift much — but it does smooth the loss landscape, making gradients more predictable and allowing larger stable learning rates. The technique works; the originally-given reason is shaky.

Either way, the practical effect for vision was huge: 10x more aggressive learning rates, faster convergence, less sensitivity to weight initialization.

Why it failed for transformers

Sequence models broke batch statistics in three ways:

Variable-length sequences — padding tokens distort per-feature batch averages.
Heterogeneous content — sequences in a batch may be from very different domains, making batch-wise statistics noisy.
Inference batch size — at deployment, batch size varies (one user request, then ten, then back to one). BN’s behavior depends on batch size, which makes inference unstable.

Layer normalization sidestepped all three. It normalizes across the feature dimension within a single token, so it is independent of what other tokens are in the batch. Every modern transformer — BERT, GPT, T5, Llama, Claude, Gemini, all of them — uses LayerNorm or its variant RMSNorm, not BatchNorm.

Where BatchNorm still lives

Domains where BN is still standard

Image classification CNNs — ResNet, EfficientNet, ConvNeXt — all still use BN.
Object detection / segmentation backbones — when batches are large and image-shaped.
Generative image models — though many newer architectures (Stable Diffusion’s UNet) mix BN with other norms or skip it.
Tabular deep learning — fixed-shape inputs with consistent batch composition.

Anywhere you have fixed-shape inputs, large batches, and consistent train/inference distribution, BN remains the strongest normalizer in practice. The “BN failed” story is specific to sequence models and to small-batch settings; it is not a universal verdict.

For activations of shape [B, F] — B examples, F features:

Batch normalization — mean and variance per feature, across the batch:

μ_f = (1/B) Σ_b x[b, f]
σ²_f = (1/B) Σ_b (x[b, f] - μ_f)²

Layer normalization — mean and variance per example, across the features:

μ_b = (1/F) Σ_f x[b, f]
σ²_b = (1/F) Σ_f (x[b, f] - μ_b)²

BN’s behavior depends on the batch — which other examples are present, batch size, padding. LayerNorm’s behavior depends only on the example itself, deterministic regardless of batch composition.

For transformers — heterogeneous batches, varying inference batch size — LayerNorm’s per-example invariance is decisive. For CNNs — large homogeneous batches — BN’s across-batch averaging is decisive. Different regimes; both work.

BN’s history is more interesting than its current relevance. It was the dominant normalization layer for half a decade, and its replacement by LayerNorm in the transformer era is one of the more under-appreciated architectural shifts of 2017-2020.

Go further

Why has BatchNorm been replaced by LayerNorm in transformers?

Sequence models break batch statistics. Sequences in a batch have different lengths, different content, and different padding patterns, so the per-feature mean and variance computed across the batch are noisy and inconsistent across steps. LayerNorm normalizes within a single token across the feature dimension, which is independent of batch composition.

Layer normalization Transformer

What was the original justification — internal covariate shift?

The 2015 paper credited BN with reducing 'internal covariate shift' — the change in activation distributions during training. Subsequent work (Santurkar et al. 2018) argued the real benefit is that BN smooths the loss landscape, allowing higher learning rates. The original explanation is now considered partially wrong even though the technique works.

Layer normalization

What about train-test mismatch?

BN computes statistics from the current batch at train time but uses running averages at test time. If the running averages drift from the true population stats — common with small or distribution-shifted batches — the model behaves differently in eval mode than in train mode. This is the classic BN production gotcha.

Vector

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs