Bias-Variance Tradeoff

Also known as: bias variance decomposition, bias-variance

TL;DR

The bias-variance tradeoff is the classical decomposition of prediction error into three additive parts: squared bias, variance, and irreducible noise.

The bias-variance tradeoff decomposes a model’s prediction error into three additive parts: squared bias, variance, and irreducible noise. Bias measures systematic underfitting; variance measures sensitivity to the training set; noise is the part no model can predict. The textbook story says you trade bias against variance by tuning model capacity. This was the dominant mental model for picking model size for thirty years. Modern deep learning has scrambled it.

The decomposition

For squared-error loss, given a true target with noise , and a model trained on a random training set, the expected test error at point decomposes as:

Bias — how far the model’s average prediction (across training sets) is from the true target. A linear model predicting a quadratic function has high bias.
Variance — how much predictions jitter from one training set to another. A 9th-degree polynomial fit to 10 points has astronomical variance.
Noise — the part of the target that’s actually random given the inputs. No model beats this.

The decomposition is mathematically true for squared error. The story built on top of it — that there’s a sweet spot in capacity where bias-squared and variance balance — is what turns out to be incomplete.

The classical picture

The textbook tradeoff plot has model capacity on the x-axis and test error on the y-axis, decomposed into:

Bias-squared decreasing with capacity (more flexible models can fit more).
Variance increasing with capacity (more flexible models fit noise too).
Total error U-shaped, with a sweet spot at moderate capacity.

This picture motivates regularization, cross-validation, model selection by AIC/BIC, and early stopping. It’s the foundation of overfitting intuition.

Why deep learning broke the picture

Empirically, modern deep networks live in a regime the textbook didn’t anticipate: massively overparameterized models trained on huge data sets, fit to interpolate the training data, that still generalize well. The U-shape becomes more like a double dip — error goes up past the classical sweet spot, then comes back down again past the interpolation threshold. This is double descent .

The decomposition is still true; the assumption that variance climbs monotonically with capacity is what fails. In practice, very large models trained with SGD have an implicit bias toward low-norm or low-complexity solutions, which keeps the effective variance bounded even as parametric capacity grows.

What modern empirical work changed

Variance no longer monotonic in model size — it can decrease past interpolation
Generalization can improve with more parameters, not worse
Implicit regularization (from SGD, from architecture) does work the classical view didn’t account for
The “sweet spot” model size depends on data scale in a way the textbook didn’t capture

The decomposition itself is a mathematical identity for squared error and it remains correct. What changed is the predictive picture of how each term scales with model capacity. The textbook assumed variance grows without bound past interpolation; in deep networks, implicit regularization from the optimizer and architecture means variance can decrease in this regime. So the equation is fine; the storyline you build on top of it needs updates. Use the equation as bookkeeping; use the storyline only in the small-model regime where the assumptions hold.

The takeaway: error decomposes into bias, variance, and noise — that’s true. The textbook claim that capacity should be chosen to balance bias and variance is incomplete in the deep-learning era, where overparameterized models routinely violate the classical curve.

Go further

What does the decomposition actually look like?

For squared-error loss, expected test error decomposes as bias squared plus variance plus irreducible noise. Bias measures how far the average prediction is from the true target; variance measures how much predictions wiggle across different training sets; noise is the part of the target that no model can predict from the inputs.

Overfitting

Why does the textbook story break for deep networks?

The decomposition is mathematically true for squared error, but the picture — that there's a sweet spot where total error is minimized at moderate capacity — assumes monotonic variance growth past interpolation. Modern empirical work shows variance can decrease again past the interpolation threshold (double descent), so the U-shaped error curve becomes a more complicated shape.

Double descent

Is the tradeoff still useful for modern practice?

As intuition: yes — too-small models still underfit, too-flexible models on too-little data still overfit. As a quantitative tool for picking model size: less useful, because the right answer at scale is usually 'much bigger than the classical picture would suggest.' Use it as a mental model for small data, not as a rule for large-scale training.

Overfitting

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs