Catastrophic Forgetting

Also known as: catastrophic interference, forgetting

TL;DR

When fine-tuning a pre-trained model on a new task erases capabilities the base model originally had. The classical neural-network failure mode that dominates fine-tuning practice — and the reason LoRA, mixed-data training, and rehearsal exist.

Catastrophic forgetting is the failure mode where fine-tuning a pre-trained model on task A causes it to lose capability on task B that it previously had. The phenomenon was named in the 1980s and has been a thorn in transfer learning ever since — neural networks have no inductive bias toward preserving knowledge they’re not currently being optimized on.

Why it happens

Standard gradient descent moves weights to minimize loss on the current data distribution. If a weight that was important for solving task B is also useful (or just slightly useful) for solving task A, the optimizer will adjust it to fit task A — possibly destroying its task-B function. There’s no explicit penalty for this in the vanilla loss; the only “memory” the network has of task B is whatever happens to be encoded in weights the optimizer doesn’t touch.

For LLMs, “task B” is everything — general world knowledge, language fluency, reasoning, code, math. A model fine-tuned aggressively on legal documents can lose the ability to write Python; a model fine-tuned on classification labels can lose the ability to follow open-ended instructions.

Symptoms in fine-tuned models

Mode collapse. The model produces outputs in the format of the fine-tuning data regardless of input. Asked an open question, it returns a one-word classification.
Capability cliff. Strong on the fine-tuning task; suddenly poor on adjacent tasks the base handled fine.
Refusal failures or generalization. Safety fine-tuning that overfits causes the model to refuse benign requests; instruction fine-tuning that overfits causes it to follow harmful ones.
Verbosity collapse. Outputs get shorter or longer than the base, depending on what the fine-tune data favored.

Mitigations

LoRA / PEFT . The base model is frozen; only a low-rank adapter trains. The base capability is physically preserved. This is the strongest mitigation in practice and the reason LoRA dominates production fine-tuning.
Smaller learning rate. A 10× smaller LR moves weights less, preserving more. Tradeoff: slower convergence on the new task.
Mixed-data fine-tuning (rehearsal). Mix in 5-20% of pre-training-style data during fine-tuning. The model has to stay good at the old distribution, which preserves the underlying weights.
Regularization to base. L2 penalty on (current weights - base weights), or KL penalty on output distributions. Explicitly punishes drift from the base.
Early stopping on a held-out general benchmark. Track MMLU or similar during fine-tuning and stop before they collapse, even if the task loss is still improving.

In retrieval specifically

Custom-trained rerankers hit a milder version: a reranker fine-tuned hard on legal documents loses general retrieval quality on out-of-domain queries. The fix is the same playbook — train via LoRA on the base reranker, mix in a fraction of the original training data, validate on out-of-domain held-out sets.

Every fine-tune trades some general capability for some specific capability. The job is to make the trade favorably.

The mechanical reason is that LoRA literally doesn’t touch the base weights. The frozen weights retain whatever they encoded during pretraining. The trainable adapter is a low-rank perturbation where and are small matrices initialized to zero (or near-zero). At deployment, the effective weights are — but has been preserved bit-for-bit.

This is fundamentally different from full fine-tuning, where every weight is a candidate for the optimizer to overwrite. With full fine-tuning, the only way to preserve a capability is for the gradient with respect to your task loss to happen to leave the relevant weights alone, which it generally won’t.

The trade-off is expressivity. LoRA can only represent perturbations of rank (typically 8-64). For most fine-tuning tasks this is plenty — the gap between the base model and the task-tuned model lives in a low-dimensional subspace. For tasks that require fundamentally restructuring how the model represents information (extending to a new language, very different output format), full fine-tuning may be necessary, and the forgetting risk has to be managed by other means (rehearsal, KL regularization, early stopping).

The practical guideline: try LoRA first. If LoRA can’t reach your quality target, then full fine-tune with rehearsal. Full fine-tune without rehearsal is almost always wrong outside research contexts.

Rehearsal mixes a fraction (typically 5-20%) of the original pretraining distribution back into the fine-tuning data. The fine-tuning loss now has two components: the new task and the old distribution. The optimizer can’t make the new task loss go down by overwriting the old-distribution capability, because the old-distribution loss would go up.

The effect is that gradients pull weights toward solutions that perform well on both distributions. There may be no such solution that’s also optimal on the new task — the model has to find a compromise. The compromise is usually much better than the unmitigated forgetting that full fine-tuning produces, but it’s slower to converge on the new task and may plateau at a slightly lower task-only score.

The exact ratio matters less than people think. The literature finds 5-20% rehearsal data works for most cases; below 5% the regularization is too weak; above 20% you’re effectively just adding more pretraining and the new task converges slowly. The harder question is what to rehearse — the original pretraining data, a held-out general-instruction dataset, or a domain-specific anchor set. The right answer depends on which capabilities you most need to preserve.

The strategic framing: rehearsal isn’t a hack, it’s a regularizer. It adds a prior that “the model should still perform well on this distribution” and lets the optimizer balance that prior against the task loss. Like all regularizers, it’s a knob — and like all knobs, the right setting is empirical.

Go further

Why does fine-tuning forget?

Gradient descent on a narrow distribution moves weights toward solutions that minimize loss on that distribution — including by overwriting weights that encoded other capabilities. Without explicit pressure to preserve prior knowledge, the optimizer has no reason to leave it intact.

Fine-tuning Supervised fine-tuning

How does LoRA help?

LoRA freezes the base model and trains only a tiny low-rank adapter. The original weights — and everything they encode — are physically unchanged. The adapter can only steer behavior on top of, not overwrite, the base capability. This is the cleanest mitigation in practice.

LoRA / PEFT Knowledge distillation

What's rehearsal / replay?

Mixing a fraction (typically 5-20%) of the original pre-training distribution into the fine-tuning data. The optimizer is forced to keep performing on the original distribution, which preserves the underlying capabilities. Costs: dataset prep complexity, slightly slower task convergence.

Synthetic data generation Instruction tuning

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs