Why is Beta the right prior for a probability rather than, say, a truncated Gaussian?
Because it's the conjugate prior to the Bernoulli — start with
Also known as: Beta(α, β), Beta prior
The Beta distribution
The Beta distribution is a continuous distribution on the unit interval
where
The cleanest way to think about Beta is as a posterior. Start with a flat prior
That’s it. No integration, no sampling, no approximation — the conjugacy with the Bernoulli/Binomial gives you a closed-form update. So
Beta is the distribution of a probability you’re estimating. Every other property — the variance shrinking like
In a pairwise preference setup — the data backbone of zELO and any Thurstone -style ranking pipeline — you ask a judge “is A more relevant than B?” and collect noisy votes. After
That posterior gives you two things at once:
The second is the under-appreciated part. Most pairwise pipelines collect a fixed number of comparisons per pair (“3 votes per (A, B)”). The Beta posterior lets you allocate adaptively: stop when the standard deviation drops below the precision you need, keep going on close pairs that are still ambiguous.
zELOZeroEntropy’s distillation pipelines use Beta posteriors to weight ensemble votes. The setup is iterative: a pool of judges (LLMs at different sizes, prompts, temperatures) each vote on the same pairs; you fit Thurstone on the votes; you then refit the judges’ reliabilities by comparing their votes to the consensus. Each judge’s reliability — the probability it agrees with the eventual ranking — is itself a Bernoulli parameter, so its posterior is Beta.
Concretely, each judge
Inverse-variance weighting is the optimal linear combination for unbiased estimators of a common parameter — it’s the recipe behind meta-analysis, Kalman filtering, and every credible ensembling scheme. The proof is short: if you have
The Bernoulli/Beta conjugate pair is one of a handful of distribution-prior pairings that give you a closed-form posterior:
The Dirichlet generalization is what you reach for when there are more than two outcomes per comparison (“which of these 5 is best?”). Same conjugacy, same variance-shrinks-with-
Beta the right prior for a probability rather than, say, a truncated Gaussian?Because it's the conjugate prior to the Bernoulli — start with
Track the variance of the Beta posterior over
Elo and Thurstone fit point estimates of latent skill from pairwise wins. Beta gives you the full posterior over each pairwise win-rate before you ever fit Thurstone. In practice you can either (a) feed Beta-mean win-rates into Thurstone, or (b) Beta-weight each comparison by its posterior variance so confident pairs dominate the fit. ZeroEntropy's distillation pipelines do the latter.