Isotonic Regression

Also known as: pool-adjacent-violators, PAV calibration, monotone calibration

TL;DR

Isotonic regression fits a non-parametric monotone function from raw scores to calibrated probabilities. More flexible than Platt scaling — handles any monotone miscalibration shape — at the cost of needing more labels and being prone to overfitting at the score-distribution tails.

Isotonic regression is the non-parametric counterpart to Platt scaling for score calibration . Instead of assuming a sigmoid shape, isotonic regression fits the most accurate monotone function from raw scores to empirical probabilities — a piecewise-constant step function with as many breakpoints as the data supports.

The fit is via the pool-adjacent-violators (PAV) algorithm. The result is the unique non-decreasing function that minimizes mean squared error against the labels:

f* = argmin Σᵢ (f(sᵢ) − yᵢ)²
s.t. sᵢ ≤ sⱼ ⇒ f(sᵢ) ≤ f(sⱼ)

Sort by raw score, walk forward, and whenever an interval’s mean violates monotonicity, merge it with the previous interval. The merged means form the calibration curve.

Why you’d choose it over Platt

Platt is constrained to a two-parameter sigmoid shape — fast, data-efficient, but blind to miscalibration patterns the sigmoid family can’t represent. Isotonic regression has no such constraint: any monotone function is a candidate, and the algorithm picks the one that best matches the labels.

Miscalibration patterns isotonic handles that Platt can't

A two-plateau curve where scores 0.3-0.6 all correspond to roughly 50% relevance, then 0.6-0.9 cluster around 80%.
An asymmetric S where the lower tail is sharp and the upper tail is flat — common with rerankers trained on imbalanced relevance distributions.
A staircase where the model’s score quantiles map to discrete relevance bands rather than a smooth probability.

Isotonic regression buys you flexibility at the cost of data hunger. A two-parameter sigmoid fits on a few hundred labels; a non-parametric monotone function wants thousands to be stable, especially at the tails of the score distribution.

The tail problem

The principal failure mode of isotonic regression is overfitting at the top and bottom of the score distribution. Most of your labels live in the middle of the score range; the top few percent (the highest-scored documents) and the bottom few percent (the lowest-scored documents) are sparse. PAV merges those sparse bins into whatever step its few data points dictate — and that step is high-variance across resamples.

Mitigations:

Tail clipping. Cap the lowest and highest bins to a fixed probability and isotonic-fit only the interior.
Hybrid Platt-isotonic. Isotonic in the middle, Platt’s sigmoid extrapolation in the tails.
Label smoothing. Add a Beta prior to each bin’s empirical rate so sparse bins regress toward 0.5.
More data. The cleanest fix when you can get it.

Choosing between Platt and isotonic

A rough decision rule:

Fewer than 1K labels — Platt. Isotonic doesn’t have enough data to outperform a two-parameter family.
1K-10K labels — fit both and pick by held-out cross-entropy on a third set.
More than 10K labels and obvious non-sigmoid shape — isotonic, with tail clipping.

The other axis is how often you re-fit. Platt’s two parameters are nearly invariant to small label-set perturbations; isotonic’s step function moves around significantly each retraining. If you re-fit on a rolling window and want stable thresholds across windows, Platt has a steadiness advantage.

“Pool-adjacent-violators” describes the algorithm literally: walking through sorted points, whenever the current point’s running mean is less than the previous block’s mean (a “violator” of monotonicity), pool the two blocks together by averaging and continue. The name dates to Brunk (1955) and shows up unchanged in modern scikit-learn docs.

The algorithm is once sorted, end-to-end. Memory is . For practical reranker calibration on a few thousand pairs, the fit takes milliseconds and the model is a list of (score-bin-boundary, calibrated-probability) tuples that you binary-search at inference.

Production caveats

Three things to know before shipping isotonic in a calibration pipeline:

Re-fit cadence. Score distributions drift; isotonic curves move with them. Re-fit on a rolling window, ideally daily for high-traffic systems.
Cross-domain transfer. An isotonic fit on domain A is almost certainly wrong for domain B — even more so than Platt. Fit per domain.
The cleaner alternative. Like Platt, isotonic regression is a post-hoc bandage on an uncalibrated training pipeline. Models trained with calibrated targets — Thurstone fits, zELO-style continuous-relevance targets — sidestep the whole issue.

Go further

How does the pool-adjacent-violators algorithm actually work?

Sort the data by raw score. Initialize each point as its own group with its own mean. Walk left-to-right; whenever an adjacent group has a smaller mean than the previous group (violates monotonicity), merge them and recompute the mean. Repeat until no violations remain. The merged-group means become the fitted calibration curve — a piecewise-constant monotone step function.

Score calibration Platt scaling

When does isotonic clearly beat Platt scaling?

When the empirical calibration curve has structure a sigmoid cannot bend around — kinks, plateaus, or a non-symmetric S-shape. Modern neural-network reranker scores often show this because the model's score distribution is heavy-tailed and the relevance rate at the top of the distribution doesn't follow a smooth sigmoid.

Platt scaling Calibration vs discrimination

What's the tail-overfitting problem?

Isotonic regression has very few labels at the high end of the score distribution (top-scored docs are rare), so the fitted curve there is determined by maybe a dozen points. The fit becomes a step function that snaps to whatever the empirical rate happens to be in those few bins — wildly noisy across draws. Mitigations include adding label smoothing, capping the fit's extreme bins, or falling back to Platt at the tails and isotonic in the middle.

Score calibration Platt scaling

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs