Learning-Rate Scheduler

Also known as: LR schedule, cosine schedule, warmup, WSD, warmup-stable-decay, linear decay

TL;DR

A learning-rate scheduler is the function that changes the learning rate over training. Linear warmup followed by cosine decay is the modern default; WSD (warmup-stable-decay) is the 2024 successor. Picking the schedule is as load-bearing as picking the peak LR.

A learning-rate scheduler is the function that controls the at training step . The modern default has three phases — linear warmup, a peak (held stable or already decaying), final decay toward zero — and the shape of is as load-bearing as its peak value. Two recipes dominate production: cosine decay with warmup, and WSD (warmup-stable-decay).

Why the schedule matters

The optimization landscape changes during training. Early: gradients on random weights have high variance; small steps. Mid: the trajectory is stable; take the largest steps the model tolerates. Late: the model is refining the basin it has found; small steps prevent overshooting. A constant cannot satisfy all three regimes, which is why constant-LR training is extinct in serious .

The standard recipes

Schedules in production training
  • Cosine decay with warmup. Ramp to peak over steps, then a half-period cosine to a small floor (~10% of peak) over . The 2018-2023 default — GPT-3, Llama-2, Mistral 7B.
  • WSD (warmup-stable-decay). Warmup, hold flat, decay over the final 10%. MiniCPM (2024), Llama-3.1, DeepSeek-V2. Allows checkpoint branching.
  • Linear decay. Warmup, then a straight line to zero. The BERT-era default; still common in short fine-tunes. Sometimes called “triangular.”
  • Constant LR. Very short fine-tunes and most LoRA training, where the run is too short for decay to matter.
  • Cyclical / SGDR. Periodic warm restarts mid-training. A pre-LLM research direction, mostly historical.

Cosine vs WSD is the most consequential schedule decision in modern LLM training — cosine commits a token budget upfront; WSD lets you train arbitrarily long at peak and decide where to decay later.

Warmup details

Warmup typically spans 0.5-2% of total steps; Llama-3 used 8000 steps over a 15T-token run (~0.05%). The first-order justification — keep small while the model takes its first steps on a random init — is correct but incomplete. The deeper reason is the : Adam’s per-parameter scaling depends on , the EMA of squared gradients, which needs steps at to be informative. Before that, is over-amplified for many parameters.

Fine-tuning vs pretraining

schedules differ in every dimension. Peak LR drops 10-100x — to versus to — because the model starts close to the right answer and large steps erase pretrained knowledge. Warmup shrinks or disappears. Cosine-to-zero over 1-3 is the standard shape; constant LR is acceptable for because the adapter matrices are small and the run is short.

A cosine schedule fuses decay into the entire run. Every checkpoint sits at a different effective LR, and the total step count was fixed at step zero — resuming requires retraining or hacking the schedule.

WSD breaks the coupling. During the stable phase the LR is constant, so every plateau checkpoint is in the same optimization regime — the “exploration” half of training. From one plateau checkpoint you can launch several independent decay phases — math, code, long-context — each yielding a specialized model without retraining. MiniCPM named this property; Llama-3.1 used it for its long-context variant. The cost is a slightly worse final loss than cosine at matched compute if you commit to a single decay — WSD pays a small loss penalty for the right to specialize cheaply at the end.

Go further

Why warmup at all?

Early-step gradients are dominated by the random initialization and have high variance; starting at the peak LR diverges. Linear warmup from 0 to the peak over the first 0.5-1% of steps lets the optimizer's second-moment estimates (Adam's ) stabilize before taking large steps.

Cosine vs WSD — what's the actual difference?

Cosine commits the entire token budget upfront; you can't stop early without retraining. WSD (warmup-stable-decay) holds the peak LR for the bulk of training and only decays at the end, so you can train arbitrarily long at peak, branch checkpoints for different downstream uses, and decay one for each. The MiniCPM / Llama-3.1 era recipe.

How does the schedule change for fine-tuning vs pretraining?

Fine-tuning uses a much smaller peak LR (typically to , 10-100x smaller than pretraining), and shorter or no explicit warmup. Cosine to zero over 1-3 epochs is standard. Constant LR is sometimes used for LoRA.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord