Ensemble Learning

Q: Why does averaging help at all — isn't a smarter single model better?

The bias-variance decomposition is the answer. If FORMULA models each have variance FORMULA and pairwise correlation FORMULA, the variance of their average is FORMULA. As long as the models aren't perfectly correlated, ensembling reduces variance toward FORMULA without touching bias. The win is largest when members make different mistakes — same accuracy, different failure modes.

Q: Bagging vs boosting vs stacking — what's the actual difference?

Bagging trains FORMULA independent models on bootstrap samples and averages — variance reduction, embarrassingly parallel (random forests are the canonical case). Boosting trains models sequentially, each fitting the residuals of the previous ensemble — bias reduction, can't parallelize across rounds (XGBoost, LightGBM). Stacking trains a meta-model on top of the base members' predictions — learns the weighting scheme rather than fixing it to uniform.

Also known as: model ensembling, ensembles, model averaging

TL;DR

Combining the predictions of multiple models — bagging, boosting, stacking — to get a single output more accurate than any individual member.

Ensemble learning is the practice of combining predictions from multiple models to produce a single output that’s more accurate — or more robust — than any individual member.

The technique is older than deep learning; Breiman’s bagging paper is from 1996, AdaBoost from 1997. But ensembles never went away: every Kaggle winner uses them, every production reranker built on frontier-LLM supervision implicitly relies on them, and the entire knowledge-distillation pipeline starts with one.

The bias-variance justification

The cleanest argument for ensembling falls out of the bias-variance decomposition . For models with individual variance and pairwise error correlation :

Bias of the average equals the average of the biases — ensembling doesn’t fix systematic error. But variance collapses with as long as members aren’t perfectly correlated. The whole game is decorrelating the members: different bootstrap samples, different feature subsets, different architectures, different random seeds, different teachers.

Ensembling reduces variance, not bias. Three biased models that all fail the same way average to a model with the same bias and slightly less variance — a marginal win. Three models with uncorrelated failure modes average to something genuinely better.

The three flavors

Bagging, boosting, stacking

Bagging (bootstrap aggregating) — train independent models on bootstrap samples of the data, average the predictions. Variance reduction, embarrassingly parallel. Random forests are the canonical case: decision trees, each on a different sample of rows and columns.
Boosting — train models sequentially, each one fitting the errors of the cumulative ensemble so far. Bias reduction, inherently sequential. AdaBoost, gradient boosting, XGBoost, LightGBM — the family that wins tabular benchmarks.
Stacking — train base members on the data, then train a meta-model on top of their predictions to learn how to combine them. The weighting scheme is data-dependent rather than fixed at uniform. Can be more powerful than simple averaging when members have different per-region accuracy.

The modern usage: ensembles as teachers

Production AI systems rarely deploy ensembles directly — they’re too expensive at inference. The interesting modern application is ensembles of teachers in a distillation pipeline. Three frontier LLMs vote on labels; the consensus signal trains a small student; the student is what ships.

This is where the weighting scheme starts to matter, and where uniform averaging stops being the right answer.

Uncertainty-weighted ensembles at ZeroEntropy

ZeroEntropy’s distillation pipeline weights each teacher’s vote by the uncertainty of its judgment, not by a fixed coefficient. Concretely: for each pairwise judgment, fit a Beta posterior over the teacher’s preference probability and use the posterior variance as a confidence signal. A teacher that returns a sharp for “doc A beats doc B” carries more weight in that example’s loss than a teacher returning a flat .

The pipeline is also iterative. After fitting an initial student, the student’s residuals re-weight the next round of ensemble queries — the teachers spend their budget on examples the student is still wrong about. Two consequences:

The ensemble allocation is information-theoretic, not uniform. Easy examples get cheap supervision; hard examples get re-queried with higher teacher weight.
The student’s training distribution shifts each round toward the residual frontier. Convergence is much faster than a static ensemble label set.

A teacher LLM doesn’t return a single vote — it returns a probability distribution over {A wins, B wins, tie}. Treating that as a point estimate throws away the model’s calibrated uncertainty. A Beta-Binomial conjugate model is the right shape: prior , posterior after observing A-wins and B-wins from repeated queries.

The posterior mean is the consensus probability; the posterior variance is how confidently we know that probability. Variance-weighted averaging across teachers is equivalent to maximum-likelihood estimation under independent Gaussian noise — the standard right answer when you have heteroscedastic estimates of a shared latent.

Uniform averaging is what falls out as a limiting case when all teachers happen to be equally confident on every example, which essentially never happens in practice.

When ensembles fail to help

Two classic failure modes worth knowing:

Highly correlated members. If all models share the same biased training data or the same architecture, and the variance term vanishes — you’ve paid for no gain. Diversity is non-negotiable.
Calibration mismatch. Averaging logits, probabilities, and ranks gives different answers when members are differently calibrated. Pre-calibrating each member (Platt scaling, isotonic regression, or a learned calibration head) before averaging is often the difference between an ensemble that helps and one that’s just slower.

The ensemble is the easy part. The weighting is where the work is.

Go further

Why does averaging help at all — isn't a smarter single model better?

The bias-variance decomposition is the answer. If models each have variance and pairwise correlation , the variance of their average is . As long as the models aren't perfectly correlated, ensembling reduces variance toward without touching bias. The win is largest when members make different mistakes — same accuracy, different failure modes.

Bias-variance tradeoff Knowledge distillation

Bagging vs boosting vs stacking — what's the actual difference?

Bagging trains independent models on bootstrap samples and averages — variance reduction, embarrassingly parallel (random forests are the canonical case). Boosting trains models sequentially, each fitting the residuals of the previous ensemble — bias reduction, can't parallelize across rounds (XGBoost, LightGBM). Stacking trains a meta-model on top of the base members' predictions — learns the weighting scheme rather than fixing it to uniform.

Weak supervision Score calibration

If ensembles work so well, why does anyone deploy single models?

Cost. An ensemble of three frontier LLMs costs roughly 3x the latency and 3x the price of one. The modern resolution is distill the ensemble into a small student — let three frontier LLMs vote pairwise, fit a Thurstone or Bradley-Terry model to extract a calibrated continuous target, and train a 4B-parameter student to regress against it. The student inherits the ensemble's accuracy at single-model cost.

Knowledge distillation Pairwise reranker

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs