Learning Rate

Also known as: LR, step size, η, eta

TL;DR

The learning rate is the scalar η in the gradient-descent update — how big a step to take in the direction of the negative gradient. Too high diverges, too low stalls, and getting it right is the single most important hyperparameter in training.

The learning rate is the scalar that determines how far each gradient descent step moves the parameters:

θ ← θ - η ∇L(θ)

It is the single most important hyperparameter in any training run. Set it too high and the loss diverges — parameters bounce past low-loss regions and either explode to NaN or oscillate forever. Set it too low and training crawls — the loss decreases, but you waste compute taking steps that are smaller than necessary. Most of the art of training is picking the right LR schedule.

Too high vs too low

Mental picture: the loss is a long, narrow valley. Small crawls down the valley floor. Medium makes good progress. Large jumps to the opposite wall each step, oscillating without descending. Enormous jumps over the valley entirely into higher loss, then again, then again — diverging.

LR-too-high looks like a loss curve that decreases for a while, then explodes upward and produces NaNs. LR-too-low looks like a smooth-but-slow descent that burns GPU hours without progress.

Warmup-then-decay schedules

Constant learning rates are nearly extinct in modern LLM training. The dominant pattern is warmup followed by decay:

Warmup. For the first few hundred to few thousand steps, ramp linearly from 0 (or a tiny floor) up to its peak. This gives the optimizer — typically AdamW — time to populate its variance estimates from real gradients before they get used to scale large updates.
Peak. Hold the peak for a brief window, or skip the hold and start decaying immediately.
Decay. Smoothly reduce toward a small floor (often 10% of peak) over the rest of training. Cosine decay is the modern default; linear decay is the runner-up; some recipes use inverse-square-root.

The shape matters less than the principle: start small, climb to a peak that’s as large as the model will tolerate without diverging, then come back down so the model can settle into a low-loss region without overshooting.

Adam’s per-parameter scaling depends on running estimates of the first and second moments of the gradient. At step zero, both moments are zero. The bias-correction terms and partially fix this for the expected value, but they don’t fix the variance of the moment estimates — early estimates are noisy because they’re built from very few samples.

If you take large steps with these noisy moment estimates, you can corrupt parameter values badly and even reach a region of the loss where the gradient itself becomes meaningless. Warmup avoids this by keeping tiny while the moment estimates stabilize, then ramping up once the optimizer’s internal state is trustworthy.

This is also why resuming a training run from a checkpoint without restoring optimizer state usually requires a fresh warmup — the optimizer state is what makes a high LR safe.

How LR interacts with batch size and model size

Two scaling rules matter. LR scales roughly linearly with batch size — a doubled batch halves gradient noise, so you can take roughly twice the step. The rule breaks at very large batches but is the right starting point. LR also scales inversely with model size: larger models need smaller learning rates, with rules of thumb like .

Fine-tuning vs pretraining

Pretraining typically uses peak LRs around 1e-4 to 3e-4 for AdamW. Fine-tuning uses much smaller values — 1e-5 to 5e-5 for full fine-tuning, sometimes 1e-4 for LoRA. Pretraining starts from random weights where every parameter needs to move a lot; fine-tuning starts from a model that’s already most of the way there, and aggressive updates erase what it already learned.

Go further

Why warmup, then decay?

Warmup avoids the early-step instability caused by uninitialized optimizer state — Adam's variance estimates need a few hundred steps to stabilize, and a high LR before then can blow up the model. Decay then helps the optimizer settle into a low-loss region as training progresses. The combo is universal across modern LLM training recipes.

Optimizer

What's the difference between cosine and linear decay?

Cosine decay smoothly drops the LR following half a cosine wave from peak to a small floor. Linear decay drops it on a straight line. Cosine is the default in most LLM pretraining (Llama, Mistral, Qwen, GPT) because the slow tail at the end gives the model time to refine without overshooting. Linear is common in fine-tuning.

Pretraining

How do you actually pick a peak learning rate?

Empirically, by sweeping at a smaller scale and using the loss curve to find the largest LR that doesn't diverge. Typical peak values: roughly 1e-4 to 3e-4 for AdamW-trained transformers at base size, scaling down inversely with model size for larger models. Karpathy's 'LR finder' (linearly increase LR until loss explodes) is the canonical heuristic.

Optimizer

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs