Thurstone Model

Also known as: Thurstone case V, law of comparative judgment

TL;DR

A statistical model from 1927 that converts pairwise comparisons into continuous quality scores. Foundational to chess Elo ratings, food preference studies, and modern reranker training via the zELO methodology.

Pairwise → continuous, via the normal CDFPAIRWISE COMPARISONSwho beat whom, how oftenA>B 7:3B>C 6:4A>C 8:2C>D 6:4B>D 7:3D>E 6:4A>E 9:1ABCDETHURSTONE MLERECOVERED SCORESone continuous value per itemA0.86B0.62C0.48D0.30E0.14THURSTONE CDFP(A > B) = ½(1 + erf(Δ/σ))1½0P(A wins)0+Δ = score(A) − score(B)Δ0.95A vs C: model 0.95 ≈ obs 0.80

The Thurstone model (Thurstone, 1927) is the statistical bridge between pairwise comparisons and continuous scores. Given a graph of “A beat B in N comparisons, B beat C in M comparisons, A beat C…” it recovers a score for each item such that the probability A beats B can be predicted from .

The Case V variant (the most common) assumes the difference in scores is normally distributed: where is the standard normal CDF. Equivalently using the error function: .

The normal CDF isn’t a stylistic choice — it falls out of an additive-Gaussian-noise model. Each item carries a latent quality ; a comparison samples both and reports the winner. The difference , and . The shaded right tail of the difference distribution is the win-rate.

THURSTONE · LATENT-VARIABLE INTERPRETATIONWhere the normal CDF comes from.LATENT QUALITY · Xi ~ N(μi, σ²)-2-1012latent scoreABDIFFERENCE · XA − XB ~ N(Δμ, 2σ²)-3-2-10123XA − XBP(A WINS)50%P(A > B) = Φ(A − μB) / σ√2)

The same idea is also: chess Elo

Chess Elo (Arpad Elo, 1960s) is a logistic variant of Thurstone — same shape, same substitution principle, different probability function (logistic vs Gaussian). Both convert pairwise win/loss outcomes into continuous skill ratings. They give nearly indistinguishable results in practice; the choice between them is more aesthetic than predictive.

When the paper says “we use a Thurstone fit”, it means: from the graph of (q, doc_A, doc_B), produce a single continuous relevance score per (q, d) — same statistical machinery as ranking chess players.

Where Thurstone-style fits already live
  • Chess and most competitive game ratings (Elo, Glicko, TrueSkill — all logistic relatives).
  • Food preference and consumer-research panels — the original Thurstone use case.
  • Educational testing — Rasch model is a one-parameter logistic Thurstone.
  • LLM ranking arenas (LMSys Chatbot Arena) — Bradley-Terry over pairwise human votes.
  • Reranker training in zELO — Thurstone over LLM pairwise judgments.

Why this matters for retrieval

are a much less noisy supervision signal than absolute scores. But need scores, not preferences. The Thurstone fit is the conversion: feed it a sparse graph of pairwise comparisons, get back a single continuous score per item, calibrated against every comparison observed.

For a query with 100 candidate documents, you don’t need all 4,950 pairwise comparisons. The Thurstone MLE works fine on a sparse graph — typically a with k=4 gives plenty of signal. That’s only 200 comparisons per query (0.4% of the dense matrix), which is why zELO scales economically across millions of (q, d) pairs.

Given a graph of pairwise outcomes — for each edge (i, j) you observed n_ij comparisons of which w_ij went to i — the MLE finds scores s_1, ..., s_N that maximize the joint likelihood:

This is convex in up to a global shift, so a few iterations of Newton-Raphson or stochastic gradient descent converge to the unique optimum (modulo the shift, which is fixed by anchoring one item or constraining ).

The connectedness condition matters: if the comparison graph splits into two disconnected components, no finite-likelihood solution links them — they share no observations. A Hamiltonian-cycle-based sampling strategy guarantees connectedness with the smallest possible edge count, which is why zELO uses it.

What the recovered score looks like

The output of a Thurstone fit is on the real line; the convention in zELO is to map it to [0, 1] so that 0 = least relevant and 1 = most relevant for the query. These [0, 1] scores become the regression targets when training the (zerank-2). MSE loss against Thurstone-recovered targets gives a reranker calibrated against pairwise preferences without ever asking annotators for absolute scores.

Go further

Thurstone (Gaussian) vs Bradley-Terry/Elo (logistic) — does the choice matter?

In practice, no. Both are link functions over the same difference-of-scores model, and they give nearly identical rankings on real data. Thurstone's Gaussian shape is more natural if you think of latent quality as additive noise; Elo's logistic is more natural if you think of preferences as Boltzmann-distributed.

How sparse can the comparison graph be before MLE breaks?

Surprisingly sparse. A k-regular graph with k=4 (each item compared to 4 others) is enough to recover stable scores at scale, provided the graph is connected. zELO uses the union of k/2 random Hamiltonian cycles — guaranteed connected, diameter 2, only 0.4% of dense pairs.

What if the raters disagree?

Disagreement is fine — even noisy — as long as it's not adversarial. The MLE pools probabilities across many comparisons, so individual rater noise averages out. zELO additionally fits per-rater Beta calibration so unreliable raters get down-weighted.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord