Weight Decay

Also known as: L2 regularization, L2 penalty, ridge regularization, AdamW decay

TL;DR

Weight decay is L2 regularization on model parameters: add λ ||θ||² to the loss to penalize large weights. It biases the optimizer toward simpler functions and is the dominant regularizer in modern LLM training.

Weight decay is L2 regularization on the parameter : add to the training loss, where is the of the flattened parameters. The gradient picks up an extra term, which on every step pushes weights gently toward zero. The result is a soft preference for smaller-magnitude parameters — an implicit prior toward simpler functions.

WEIGHT DECAY · L2 REGULARIZATIONTilt the landscape toward the origin.0θ★ vanillaθ★ decay+ λ ‖θ‖²REGULARIZED LOSSL̃(θ)=L(θ)+λ ‖θ‖²λ = 1.30 · A = 1.0 · B = 1.0PARAMETER NORM ‖θ★‖vanilla‖θ★‖ = 0.658+ weight decay‖θ★‖ = 0.286pulled toward 0 by 57%A VANILLA LOSS BOWL · MINIMUM AWAY FROM ORIGIN

Why penalize parameter magnitude

Two intuitions, both correct:

Bayesian. A penalty is identical, up to a constant, to placing an isotropic Gaussian prior with mean zero on every weight and computing the MAP estimate. Larger = tighter prior = stronger preference for small weights.

Geometric. Most overfitting solutions in a high-capacity model have at least one direction of large weights — the model is using a sharp feature to memorize a few examples. Penalizing magnitude tilts the loss landscape so broader, smaller-magnitude solutions win. Those generalize better.

Weight decay biases toward functions whose parameters stay small, which empirically corresponds to functions that interpolate rather than memorize.

L2 regularization vs AdamW decay

On plain SGD, “L2 regularization” and “weight decay” are the same thing. The update for SGD with L2 penalty :

That is the same as multiplying weights by after a normal gradient step. SGD with L2 is SGD with weight decay; the names were used interchangeably for years.

Adam broke this equivalence. Adam adaptively rescales every gradient component by its running variance — fold into the gradient and Adam rescales that too. Large weights with small gradient variance get their decay term amplified; small weights with large variance get theirs suppressed. The decay stops behaving like uniform shrinkage.

Loshchilov and Hutter’s AdamW (2019) fixed this by decoupling decay from the gradient. AdamW applies the decay directly to the weights, after the Adam update:

The term sees the raw learning rate, never goes through the moment estimates, and shrinks every weight uniformly. AdamW is the version that gives you the regularization story you actually want from an L2 penalty.

Typical values

Common decay coefficients
  • — frontier LLM pretraining (GPT-3, Llama, Qwen). Surprisingly aggressive.
  • — standard fine-tuning default; the Hugging Face Trainer’s default.
  • — vision models, classical CNNs.
  • — when other regularizers (heavy data augmentation, early stopping) are doing the work.

The aggressive used in LLM pretraining looks alarming until you remember that AdamW’s decay is multiplied by the learning rate — and frontier learning rates are tiny (3e-4 or so). Net effective shrinkage per step is small.

What weight decay does not regularize (typically)

Production training scripts almost universally exclude two parameter groups from decay:

  • Layer-norm scale and bias parameters — these have specific functional roles (rescaling activations); pulling them toward zero breaks the layer.
  • Embedding-layer biases and any explicit bias term in linear layers.

The convention is “decay weights, don’t decay biases or norms.” Forgetting to set this up is a silent training-quality bug — the model still trains, but slightly worse. Hugging Face’s get_optimizer_grouped_parameters and equivalent helpers exist precisely for this.

At pretraining scale, you might think regularization should not matter — the model sees each token roughly once, there is no fixed training set to memorize. Yet weight decay still helps measurably. Why?

Two reasons. First, the inductive bias from the prior is doing work even on a stream of unique data: among the many functions that fit the training distribution equally well, weight decay picks the one with the smallest parameter norm. That solution typically extrapolates better.

Second, weight decay interacts with the optimizer’s loss surface. With , gradient descent has a tendency toward solutions that are flatter in parameter space — large flat minima rather than sharp narrow ones. Empirically, flat minima generalize better. There is no rigorous proof of why this matters at pretraining scale, but the empirical finding is robust across model sizes from 100M to 1T parameters.

A third pragmatic reason: weight decay caps the long-term magnitude growth of weights, which keeps numerics stable. Without any decay, a model trained for hundreds of billions of tokens drifts toward larger and larger weights, eventually overflowing in mixed-precision arithmetic.

Weight decay is the regularizer that survived contact with the scaling era. Dropout is mostly off, data augmentation barely applies to text, and is irrelevant when you train a model exactly once. What is left is in your AdamW config, doing quiet work on every gradient step.

Go further

What is the difference between L2 regularization and AdamW weight decay?

L2 regularization adds λ ||θ||² to the loss, so the gradient gets a 2λθ term and the optimizer rescales it like any other gradient (Adam divides by the gradient's variance). AdamW applies the decay directly to the weights — θ ← θ − η λ θ — bypassing the moment estimates. The two are equivalent in plain SGD but differ meaningfully in Adam, where AdamW is the version that actually generalizes.

Why is weight decay the dominant LLM regularizer?

At trillion-token scale, classic regularizers like dropout are off and data augmentation is whatever the tokenizer happens to do. Weight decay is the one regularization knob that survives — it costs nothing, it composes with everything, and it provides a meaningful prior toward smaller-magnitude parameters even when training data is effectively infinite.

What does λ ||θ||² actually do to the loss landscape?

It tilts the landscape toward the origin. Equivalent to placing a Gaussian prior with mean zero on every parameter — maximum-likelihood under that prior is maximum-a-posteriori with a normal regularizer. Geometrically, the optimum shifts toward the smallest-norm solution that fits the data.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord