Optimizer

Also known as: Adam, AdamW, SGD with momentum, training optimizer

TL;DR

An optimizer is the wrapper around vanilla gradient descent that decides how each parameter actually gets updated. Adam, AdamW, and SGD-with-momentum are the workhorses.

An optimizer is the algorithm that turns a gradient into a parameter update. Vanilla is the simplest — — and nobody trains modern networks with it. Real optimizers wrap the update with running statistics, momentum, and per-parameter scaling that keep convergence fast and stable across the heterogeneous parameter landscape of a deep network.

OPTIMIZERS · SAME LANDSCAPE, DIFFERENT WALKSHow the gradient becomes a step.θ★UPDATE RULESSGDno stateθ ← θ − η gfinal L = 0.0960Momentum1 buffer / paramv ← βv + g · θ ← θ − ηvfinal L = 0.0063Adam2 buffers / paramθ ← θ − η m̂ / (√v̂ + ε)final L = 0.0151A CURVING NARROW VALLEY · STEEP WALLS, SHALLOW FLOOR

The three you’ll meet

Production optimizers
  • SGD with momentum — Track an exponential moving average of past gradients (the “momentum”). Update with the smoothed gradient instead of the raw one. Standard for vision models, especially with large batches and cosine LR schedules.
  • Adam — Track both a first moment (running mean) and a second moment (running variance) of the gradient, then divide the update by the square root of the variance. Gives every parameter its own effective learning rate. Default for non-LLM deep learning.
  • AdamW — Adam with decoupled weight decay. Universal default for LLM pretraining and fine-tuning in 2026.

The trend is clear: each generation of optimizer adds more per-parameter state. SGD has zero, momentum has one running average per parameter, Adam has two, and second-order methods like Shampoo have full per-parameter matrices. More state means more memory but also more robustness to bad and ill-conditioned loss landscapes.

Why Adam dominates LLM training

An LLM has parameters with very different gradient magnitudes — embedding lookups get huge sparse gradients, LayerNorm scales get tiny ones, attention QKV projections sit between. SGD with a single global learning rate can’t serve all of them; you’d need per-layer rates, which is brittle at scale.

Adam handles this automatically. Its per-parameter variance estimate rescales each update so all parameters move on a similar timescale, regardless of raw gradient magnitude. The result is a recipe that “just works” with minimal tuning — and at the scale of LLM training, where each hyperparameter sweep costs millions of dollars, just works wins.

Given gradient at step :

m_t = β₁ · m_{t-1} + (1 - β₁) · g_t          # first moment (mean)
v_t = β₂ · v_{t-1} + (1 - β₂) · g_t²         # second moment (variance)

m̂_t = m_t / (1 - β₁ᵗ)                        # bias correction
v̂_t = v_t / (1 - β₂ᵗ)                        # bias correction

θ_t = θ_{t-1} - η · m̂_t / (sqrt(v̂_t) + ε)

Typical hyperparameters: β₁ = 0.9, β₂ = 0.95 for LLMs (0.999 for vision), ε = 1e-8. The bias correction terms compensate for the moment estimates starting at zero, which would otherwise bias the update downward in early steps.

AdamW changes one thing: weight decay is applied as a separate step, not folded into the gradient. This decouples regularization from the adaptive scaling and is empirically much better for transformer training.

The memory cost

Adam stores two extra tensors per parameter (first and second moments), each the same dtype and shape as the parameter itself. For a 70B-parameter model in float32, that’s roughly 280 GB of optimizer state — more than the model weights. This is why distributed training frameworks like ZeRO, FSDP, and DeepSpeed shard optimizer state across GPUs as their first move; without sharding, optimizer state alone won’t fit on a single node.

Newer entrants

Several optimizers have challenged AdamW: Lion (sign-of-momentum updates, lower memory), Shampoo / SOAP (second-order methods approximating per-layer gradient covariance), and Muon (a geometric variant getting traction in 2025-2026 releases for non-embedding parameters). AdamW remains the default for any production training run without a specific reason to deviate.

Go further

Why does Adam dominate LLM training?

Adam's per-parameter adaptive scaling handles the wildly different gradient magnitudes you see across embedding tables, attention weights, and LayerNorm scales — all in the same model. SGD with momentum needs careful per-layer tuning to keep up; Adam mostly works out of the box, which matters at the scale of frontier-model training.

What's the difference between Adam and AdamW?

AdamW decouples weight decay from the gradient update. In plain Adam, weight decay gets blended into the running gradient statistics, which interacts badly with adaptive scaling. AdamW applies decay directly to the parameter, separately. Every modern LLM uses AdamW, not Adam.

How much extra memory does Adam cost?

Adam stores two extra tensors per parameter — the first and second moment estimates — both the same shape as the parameters. That's 2x the parameter memory for optimizer state, plus 1x for gradients, plus 1x for parameters: roughly 4x parameter memory just to fit one training step. This is why optimizer-state sharding (ZeRO, FSDP) became essential.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord