Why does averaging help at all — isn't a smarter single model better?
The bias-variance decomposition is the answer. If
Also known as: model ensembling, ensembles, model averaging
Combining the predictions of multiple models — bagging, boosting, stacking — to get a single output more accurate than any individual member.
Ensemble learning is the practice of combining predictions from multiple models to produce a single output that’s more accurate — or more robust — than any individual member.
The technique is older than deep learning; Breiman’s bagging paper is from 1996, AdaBoost from 1997. But ensembles never went away: every Kaggle winner uses them, every production reranker built on frontier-LLM supervision implicitly relies on them, and the entire knowledge-distillation pipeline starts with one.
The cleanest argument for ensembling falls out of the bias-variance decomposition . For
Bias of the average equals the average of the biases — ensembling doesn’t fix systematic error. But variance collapses with
Ensembling reduces variance, not bias. Three biased models that all fail the same way average to a model with the same bias and slightly less variance — a marginal win. Three models with uncorrelated failure modes average to something genuinely better.
Production AI systems rarely deploy ensembles directly — they’re too expensive at inference. The interesting modern application is ensembles of teachers in a distillation pipeline. Three frontier LLMs vote on labels; the consensus signal trains a small student; the student is what ships.
This is where the weighting scheme starts to matter, and where uniform averaging stops being the right answer.
ZeroEntropy’s distillation pipeline weights each teacher’s vote by the uncertainty of its judgment, not by a fixed coefficient. Concretely: for each pairwise judgment, fit a Beta posterior over the teacher’s preference probability and use the posterior variance as a confidence signal. A teacher that returns a sharp
The pipeline is also iterative. After fitting an initial student, the student’s residuals re-weight the next round of ensemble queries — the teachers spend their budget on examples the student is still wrong about. Two consequences:
A teacher LLM doesn’t return a single vote — it returns a probability distribution over {A wins, B wins, tie}. Treating that as a point estimate throws away the model’s calibrated uncertainty. A Beta-Binomial conjugate model is the right shape: prior
The posterior mean is the consensus probability; the posterior variance
Uniform averaging is what falls out as a limiting case when all teachers happen to be equally confident on every example, which essentially never happens in practice.
Two classic failure modes worth knowing:
The ensemble is the easy part. The weighting is where the work is.
The bias-variance decomposition is the answer. If
Bagging trains
Cost. An ensemble of three frontier LLMs costs roughly 3x the latency and 3x the price of one. The modern resolution is distill the ensemble into a small student — let three frontier LLMs vote pairwise, fit a Thurstone or Bradley-Terry model to extract a calibrated continuous target, and train a 4B-parameter student to regress against it. The student inherits the ensemble's accuracy at single-model cost.