Also known as: early termination, best-checkpoint selection
TL;DR
Early stopping halts training when validation loss starts climbing, even though training loss is still falling. It is the cheapest regularizer ever invented — no hyperparameter, no extra compute, no extra parameters.
Early stopping monitors validation loss during training and halts the run once it starts climbing — even though training loss is still falling. The checkpoint with the lowest validation loss is the one you ship. No hyperparameter to tune beyond patience, no extra parameters, no inference cost. It has been the standard regularizer in supervised deep learning for thirty years.
The U-curve
The classic overfitting picture: training loss falls monotonically, validation loss falls then rises, the gap widens past a critical iteration. The minimum on the validation curve is the point where the model has fit the signal but not yet memorized the noise. Early stopping is the algorithm “stop there.”
The alternative is to train to convergence and lean on weight decay , dropout , and data augmentation to hold the gap small. In small-data regimes those are not enough on their own — early stopping is the cheap fix.
Implementation: patience and best-so-far
Two knobs:
Patience — evaluation rounds to wait without improvement before halting. Patience 3 means “stop if validation loss has not set a new minimum in 3 consecutive evals.”
Min-delta — the smallest improvement that counts. Stops you from chasing noise.
The checkpoint you ship is the lowest-validation-loss one seen across the whole run, not the last one before halting. Hugging Face Trainer, PyTorch Lightning, and Keras all implement this as save_best_model = True plus a patience counter.
Why it is implicitly L2
For linear regression with gradient descent , stopping after steps yields the same solution as solving the L2-regularized problem at a that depends on . Early stopping is L2 in disguise. In nonlinear nets the equivalence is approximate but the intuition holds: fewer optimizer steps means parameters stay closer to initialization, which is closer to the simplest function consistent with the data. This is why early stopping and weight decay are partial substitutes; modern recipes use both.
When early stopping is irrelevant
Where you skip early stopping
LLM pretraining — one pass through a fixed token budget; validation loss does not climb.
Single- epoch training on streaming data — same reason; the model never re-sees data.
Models tiny relative to dataset — overfitting is not the failure mode.
Already heavy regularization — high weight decay, lots of augmentation; the U-curve is flat enough that the stopping point does not matter.
The first case is the consequential one. Frontier LLMs see roughly each token once across hundreds of billions of tokens, so both training and validation loss decrease monotonically until compute runs out. The “best” checkpoint is simply the last one. The U-curve never appears.
Yes. Fine-tuning is exactly where early stopping shines: small task dataset, easy to overfit if you train too long. The classic recipe is 1-3 epochs, evaluate every few hundred steps, save the best checkpoint. Patience can be tight (2-3 evals) because the runs are short and the curves are clean. Hugging Face’s Trainer defaults to this with load_best_model_at_end=True.
For instruction-tuning of frontier models the regime sits between pretraining and classical fine-tuning: early stopping rarely fires, but a best-checkpoint save policy is cheap insurance against runs that go off the rails late.
Early stopping does not solve catastrophic forgetting — losing general capability while gaining task performance is a separate failure mode that needs regularization toward the base model (KL penalty, parameter freezing), not training-duration control.
Go further
Why is early stopping considered a regularizer?
It implicitly bounds the function class explored by the optimizer. Stopping at iteration T is equivalent to constraining the parameter trajectory to whatever is reachable in T steps — a smaller, simpler set than the asymptotic optimum. There is a formal equivalence to L2 regularization for linear models.
What about LLM pretraining — why does it not use early stopping?
LLM pretraining trains for one epoch on a fixed token budget chosen ahead of time. The validation loss does not actually start climbing — the training and validation losses both decrease monotonically because the model never re-sees data. There is no overfitting U-curve to detect, so there is nothing to stop early.
Patience is the number of evaluation rounds you wait without improvement before stopping. A patience of 3 means 'stop if validation loss has not improved for 3 consecutive evals.' Higher patience is more conservative and gives noisy validation curves room to recover; lower patience saves compute but risks stopping prematurely on noise.