Platt Scaling

Also known as: Platt's method, sigmoid calibration, logistic calibration

TL;DR

Platt scaling fits a logistic sigmoid on top of a model's raw scores to produce calibrated probabilities. Cheap, two parameters, the standard first-resort calibration method for SVMs, classifiers, and uncalibrated rerankers.

Platt scaling is the cheapest way to retrofit calibration onto an uncalibrated model. Take the model’s raw score for an input, push it through a fitted sigmoid, and use the output as a probability estimate:

p_calibrated = σ(a · s + b)

The two parameters (slope) and (intercept) are fit by maximum-likelihood on a held-out labeled set — minimizing binary cross-entropy between the calibrated probability and the labels. Originally proposed by Platt (1999) for SVMs, the method now serves as the default first-pass calibration for any model whose raw scores don’t behave like probabilities.

Why this maps onto reranker calibration

A reranker that’s rank-correct but uncalibrated emits a score on some arbitrary scale — maybe 0-1, maybe a logit, maybe a centered z-score. The ordering is right but a score of 0.7 doesn’t reliably mean “70% probability of relevance.” Platt scaling fits a sigmoid that maps the raw scale onto empirical relevance rates from a labeled validation set, so the post-Platt score satisfies the probability calibration property.

Platt-scaling workflow for a reranker

Score every query-document pair in a held-out labeled set with the raw reranker.
Build the dataset of (raw_score, label) pairs where label is 0 or 1 for relevant.
Fit a one-variable logistic regression on (raw_score, label) — two parameters, takes milliseconds.
At inference, push every raw score through σ(a · s + b) before returning.
Validate by binning the post-calibration scores and checking the empirical relevance rate per bin tracks the diagonal.

Platt scaling is two parameters that turn a rank-correct score into a probability. That’s almost always enough to make a downstream threshold (“drop everything below 0.5”) work consistently across queries — which is the entire reason you wanted calibration.

What Platt assumes

Platt’s sigmoid form assumes the raw scores under each class follow a roughly Gaussian distribution with the same variance but different means. The sigmoid is the Bayes-optimal posterior under that assumption. In practice the assumption is rarely exact — embedder scores aren’t Gaussian, reranker logits have heavy tails — but the two-parameter family is forgiving enough that even violated assumptions produce a usable calibration most of the time.

Platt vs isotonic regression

The standard comparison:

Platt scaling: 2 parameters, needs ~hundreds of labels, low overfitting risk, smooth output, fast. Constrained to a sigmoid shape.
Isotonic regression : non-parametric, needs thousands of labels, higher overfitting risk near the score distribution tails, piecewise-constant output. Fits any monotone calibration curve.

If you have fewer than 1K labels, Platt. If you have more than 10K labels and a clearly non-sigmoid miscalibration shape, isotonic. In the 1K-10K range, fit both and pick by held-out cross-entropy.

A one-variable logistic regression is the entire fit:

from sklearn.linear_model import LogisticRegression
import numpy as np

# raw_scores: shape (N,) — the reranker's uncalibrated scores
# labels: shape (N,) — 0 / 1 ground-truth relevance
clf = LogisticRegression()
clf.fit(raw_scores.reshape(-1, 1), labels)

# At inference:
calibrated = clf.predict_proba(new_raw_scores.reshape(-1, 1))[:, 1]

That’s the entire method. The coef_ is Platt’s , the intercept_ is . The total fit takes milliseconds on a few thousand labels.

The thing to be careful about: never fit Platt on the same data you trained the reranker on. The reranker is overfit to its training set, so its training-set scores are over-separated and Platt will fit a too-aggressive sigmoid. Use a held-out validation set the reranker never saw.

Production caveats

Three failure modes worth knowing:

Distribution shift. Platt parameters fit on yesterday’s data don’t extrapolate to today’s data if the score distribution moves. Re-fit on a rolling window — weekly is a reasonable default.
Domain transfer. Platt parameters from domain A don’t carry to domain B. Each new domain wants its own fit.
The cleaner alternative. Models trained with calibration built in (Thurstone-fit targets, zELO-style training) skip the post-hoc Platt step entirely and stay calibrated under distribution shift. Post-hoc Platt is a bandage; native calibration is the cure.

Go further

How much labeled data does Platt scaling actually need?

A few hundred labeled examples is usually enough. Platt only has two parameters (slope and intercept of the sigmoid), so it's data-efficient. Below ~100 labels the fit becomes noisy; above ~10K labels you're past the point where Platt's two-parameter family can capture additional structure — switch to isotonic regression.

Isotonic regression Score calibration

Why does Platt assume the sigmoid shape?

The original Platt (1999) paper derived the sigmoid from a Gaussian-noise model over the raw scores, where positives and negatives have shifted but equal-variance score distributions. That assumption is almost never exactly true, but the sigmoid is robust enough to work decently even when the underlying distributions are skewed — which is why it's the default.

Sigmoid Calibration vs discrimination

When does Platt fail and you need something else?

When the miscalibration is non-monotone (a score of 0.7 means more relevant than 0.6 and more relevant than 0.9 — physically rare but happens with overfit models), or when the empirical calibration curve has kinks that a sigmoid can't bend around. Isotonic regression handles both at the cost of more data and more overfitting risk.

Isotonic regression Score calibration

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs