Beta Distribution

Also known as: Beta(α, β), Beta prior

TL;DR

The Beta distribution is a continuous distribution on and the conjugate prior to the Bernoulli/Binomial.

The Beta distribution is a continuous distribution on the unit interval , parameterized by two positive shape parameters and :

where is the Beta function (the normalizing constant). The mean is and the variance is . With it’s the uniform distribution; as grows the mass concentrates around the mean.

The “successes-plus-one, failures-plus-one” intuition

The cleanest way to think about Beta is as a posterior. Start with a flat prior . Observe successes and failures from a Bernoulli source. The posterior is exactly:

That’s it. No integration, no sampling, no approximation — the with the Bernoulli/Binomial gives you a closed-form update. So acts like “successes plus one” and like “failures plus one”, and the distribution itself encodes both your point estimate (, the posterior mean) and your uncertainty about it (the variance, which shrinks as you collect more data).

Beta is the distribution of a probability you’re estimating. Every other property — the variance shrinking like , the conjugacy, the closed-form posterior — falls out of that.

BETA(α, β) · CONJUGATE PRIORBeta(α, β): The Conjugate Prior0.000.250.500.751.00x ∈ [0, 1]densityPARAMETERSBeta(1, 1)mean0.500modeCONJUGATE UPDATE · CLOSED FORMBeta(2,2) + 6 successes + 1 failure=Beta(8, 3)α′ = α + s = 2 + 6 = 8 · β′ = β + f = 2 + 1 = 3 · mean shifts 0.500 → 0.727KERNEL · x^(α−1) (1−x)^(β−1) / B(α, β)mean = α / (α + β) · variance = αβ / [(α+β)² (α+β+1)]

Why it matters for pairwise comparison

In a setup — the data backbone of and any -style ranking pipeline — you ask a judge “is A more relevant than B?” and collect noisy votes. After comparisons with A-wins and B-wins, the posterior over the true win-rate is exactly .

That posterior gives you two things at once:

  • A point estimate — the posterior mean , which is the well-known Laplace-smoothed win-rate.
  • A variance, which tells you when to stop collecting pairs.

The second is the under-appreciated part. Most pairwise pipelines collect a fixed number of comparisons per pair (“3 votes per (A, B)”). The Beta posterior lets you allocate adaptively: stop when the standard deviation drops below the precision you need, keep going on close pairs that are still ambiguous.

Beta-weighted ensembling in zELO

ZeroEntropy’s distillation pipelines use Beta posteriors to weight ensemble votes. The setup is iterative: a pool of judges (LLMs at different sizes, prompts, temperatures) each vote on the same pairs; you fit Thurstone on the votes; you then refit the judges’ reliabilities by comparing their votes to the consensus. Each judge’s reliability — the probability it agrees with the eventual ranking — is itself a Bernoulli parameter, so its posterior is Beta.

Concretely, each judge has its own tracking how often it agrees with the current consensus on resolved pairs. When fitting the next iteration of Thurstone, judge ‘s vote is weighted by the inverse posterior variance of its Beta — confident-and-correct judges dominate, uncertain or wobbly judges are damped. Two consequences:

  1. The ensemble adapts to the actual reliability of each judge on this dataset, rather than relying on a global “GPT-4 is 0.85 reliable” assumption.
  2. New judges are introduced with a wide prior and earn weight as their agreement record builds — a cold-start that doesn’t require manual tuning.

Inverse-variance weighting is the optimal linear combination for unbiased estimators of a common parameter — it’s the recipe behind meta-analysis, Kalman filtering, and every credible ensembling scheme. The proof is short: if you have unbiased estimators with variances , the linear combination minimizing the variance of the combined estimate is . With Beta posteriors you get those variances for free, so the weighting becomes a one-liner. The alternative — weighting by the posterior mean (reliability itself) — over-trusts judges who happen to agree often by luck on a small sample. Variance-weighting penalizes their wide posteriors automatically.

Conjugacy: why this is more than a curiosity

The Bernoulli/Beta is one of a handful of distribution-prior pairings that give you a closed-form posterior:

Conjugate priors worth remembering
  • Bernoulli/Binomial — Beta prior, Beta posterior. The case above.
  • Poisson — Gamma prior, Gamma posterior. Used for rate estimation.
  • Normal with known variance — Normal prior on the mean, Normal posterior. The starting point for most Bayesian regression.
  • Multinomial — Dirichlet prior, Dirichlet posterior. The multi-class generalization of Beta.

The Dirichlet generalization is what you reach for when there are more than two outcomes per comparison (“which of these 5 is best?”). Same conjugacy, same variance-shrinks-with- behavior, same closed-form update. If you ever find yourself building a custom MCMC sampler for a Beta or Dirichlet posterior, stop — the answer is one line of arithmetic.

Go further

Why is Beta the right prior for a probability rather than, say, a truncated Gaussian?

Because it's the conjugate prior to the Bernoulli — start with , observe successes and failures, and the posterior is exactly . No integrals, no MCMC, no approximation. A truncated Gaussian gives you the right support but throws away the closed-form update that makes Beta useful in production.

How do I know when I've collected enough pairwise comparisons?

Track the variance of the Beta posterior over : . It shrinks like in the count of votes. Stop when the standard deviation falls below the resolution you need (commonly ~0.05) — that's the principled answer to 'is one more LLM-judge call worth it?'

What's the connection to Elo and Thurstone?

Elo and Thurstone fit point estimates of latent skill from pairwise wins. Beta gives you the full posterior over each pairwise win-rate before you ever fit Thurstone. In practice you can either (a) feed Beta-mean win-rates into Thurstone, or (b) Beta-weight each comparison by its posterior variance so confident pairs dominate the fit. ZeroEntropy's distillation pipelines do the latter.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord