Layer Normalization

Q: Why pre-norm instead of post-norm?

Pre-norm (normalize before attention/MLP) keeps the residual path clean and gradients well-behaved at depth. Post-norm (normalize after, as in the original transformer) makes training unstable past ~12 layers without warmup tricks. Every modern LLM with 50+ layers uses pre-norm.

Q: Why not batch normalization, like in vision models?

BatchNorm normalizes across the batch dimension. In language modeling, sequences in a batch have very different lengths and content, so batch statistics are noisy. LayerNorm normalizes across the feature dimension within a single token, making it independent of batch composition — which matters for stable training and for inference where batch size varies.

Also known as: layer norm, LayerNorm, RMSNorm, pre-norm, post-norm

TL;DR

Layer normalization rescales each layer's activations to zero mean and unit variance per token, then applies a learned affine transform. It stabilizes deep transformer training and is what lets modern LLMs reach hundreds of layers without diverging.

Layer normalization is a standard component of every transformer block. Each application takes a vector of activations, normalizes it to zero mean and unit variance, then applies a learned affine transform. It’s a small operation but a load-bearing one: without normalization, the deep stacks of attention and feed-forward layers in modern LLMs would not train.

The operation

Given an activation vector at one position:

Where are learned per-feature scale and shift, and is a small constant for numerical stability. The normalization is computed per token — each token’s -dimensional vector is normalized independently of its neighbors. This is the key difference from batch normalization , which normalizes across the batch dimension.

Why it stabilizes training

Deep networks have a notorious problem: as you stack layers, activations either explode or vanish unless the scale is carefully controlled. Each layer’s input distribution drifts as upstream parameters update, which makes the optimization landscape harder. Layer normalization clamps the per-token activation scale, decoupling each layer’s behavior from upstream variance. The learned give the model the flexibility to undo the normalization where useful — but the initial condition is well-behaved, which is what matters for gradient flow.

In practice this is what lets Llama-3-405B’s hundreds of layers train without divergence.

In a pre-norm block, the residual stream — the running sum of all layer contributions — is never normalized. Each block reads it, normalizes a copy before passing through attention or the FFN, and adds the result back. Gradients flowing backward see an unbroken identity path through the residual stream from output to input, so the magnitude of upstream gradients stays well-conditioned regardless of depth. Post-norm interleaves a normalization between every residual addition, which compresses gradient magnitudes layer by layer; past 12 to 24 layers, signal vanishes or explodes without learning-rate warmup tricks. The pre-norm trick is what lets 100-plus-layer models train with simple optimization recipes.

Pre-norm vs post-norm

The original transformer placed normalization after the attention and feed-forward sublayers (post-norm). Modern transformers place it before (pre-norm). Schematically, a pre-norm block is:

vs the post-norm:

The pre-norm formulation keeps the residual connection path “clean” — the residual stream is untouched, only the update added to it is normalized. This means gradients flow through the residual path with no normalization in the way, which empirically lets you train much deeper models without learning-rate warmup gymnastics. Every modern LLM uses pre-norm.

RMSNorm

Llama and most successors replace LayerNorm with RMSNorm (Zhang & Sennrich, 2019), which drops the mean-centering step:

It’s strictly simpler — no mean computation, no shift parameter. Empirically it matches LayerNorm’s training stability and final quality. Its 10-20% speedup matters at LLM scale. Qwen, Mistral, Gemma, and most modern open-weight models all use RMSNorm.

Where in the block

A modern decoder block looks like:

RMSNorm → multi-head attention → add to residual stream.
RMSNorm → SwiGLU FFN → add to residual stream.

Two normalizations per block, one before attention, one before the FFN. The residual stream itself is unnormalized; only its increments pass through normalized inputs. This pattern is why deep LLMs converge — and why removing or relocating normalization is one of the more dangerous architectural edits you can make.

Go further

Why pre-norm instead of post-norm?

Pre-norm (normalize before attention/MLP) keeps the residual path clean and gradients well-behaved at depth. Post-norm (normalize after, as in the original transformer) makes training unstable past ~12 layers without warmup tricks. Every modern LLM with 50+ layers uses pre-norm.

Residual connection Transformer

What's RMSNorm and why has it taken over?

RMSNorm normalizes by the root-mean-square of activations, dropping the mean-centering step of LayerNorm. It's ~10-20% faster, uses fewer parameters, and matches LayerNorm's quality empirically. Llama, Qwen, Mistral, and Gemma all use it.

Transformer Decoder-only model

Why not batch normalization, like in vision models?

BatchNorm normalizes across the batch dimension. In language modeling, sequences in a batch have very different lengths and content, so batch statistics are noisy. LayerNorm normalizes across the feature dimension within a single token, making it independent of batch composition — which matters for stable training and for inference where batch size varies.

Pretraining Transformer

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs