Why warmup at all?
Early-step gradients are dominated by the random initialization and have high variance; starting at the peak LR diverges. Linear warmup from 0 to the peak over the first
Also known as: LR schedule, cosine schedule, warmup, WSD, warmup-stable-decay, linear decay
A learning-rate scheduler is the function that changes the learning rate over training. Linear warmup followed by cosine decay is the modern default; WSD (warmup-stable-decay) is the 2024 successor. Picking the schedule is as load-bearing as picking the peak LR.
A learning-rate scheduler is the function
The optimization landscape changes during training. Early: gradients on random weights have high variance; small steps. Mid: the trajectory is stable; take the largest steps the model tolerates. Late: the model is refining the basin it has found; small steps prevent overshooting. A constant
Cosine vs WSD is the most consequential schedule decision in modern LLM training — cosine commits a token budget upfront; WSD lets you train arbitrarily long at peak and decide where to decay later.
Warmup typically spans 0.5-2% of total steps; Llama-3 used 8000 steps over a 15T-token run (~0.05%). The first-order justification — keep
Fine-tuning schedules differ in every dimension. Peak LR drops 10-100x —
A cosine schedule fuses decay into the entire run. Every checkpoint sits at a different effective LR, and the total step count was fixed at step zero — resuming requires retraining or hacking the schedule.
WSD breaks the coupling. During the stable phase the LR is constant, so every plateau checkpoint is in the same optimization regime — the “exploration” half of training. From one plateau checkpoint you can launch several independent decay phases — math, code, long-context — each yielding a specialized model without retraining. MiniCPM named this property; Llama-3.1 used it for its long-context variant. The cost is a slightly worse final loss than cosine at matched compute if you commit to a single decay — WSD pays a small loss penalty for the right to specialize cheaply at the end.
Early-step gradients are dominated by the random initialization and have high variance; starting at the peak LR diverges. Linear warmup from 0 to the peak over the first
Cosine commits the entire token budget upfront; you can't stop early without retraining. WSD (warmup-stable-decay) holds the peak LR for the bulk of training and only decays at the end, so you can train arbitrarily long at peak, branch checkpoints for different downstream uses, and decay one for each. The MiniCPM / Llama-3.1 era recipe.
Fine-tuning uses a much smaller peak LR (typically