Bayes' Rule

Also known as: Bayes' theorem, Bayesian inference, posterior update

TL;DR

Bayes' rule is the math of updating beliefs given evidence. Posterior ∝ likelihood × prior.

Bayes’ rule is the algebra of belief updating. Given a parameter you want to learn and data you have observed:

Four named pieces:

  • posterior. Belief about after seeing the data.
  • likelihood. How probable the data is under each candidate .
  • prior. Belief about before seeing the data.
  • evidence. A normalizing constant that makes the posterior integrate to 1.

The compact reading: posterior likelihood prior. Drop the evidence and you have everything you need to do Bayesian inference up to a constant.

BAYES' RULEBeta-Binomial update, with actual numbers.ROUND 1 · 7 RELEVANT / 10RRRNRRNRRNROUND 2 · 3 / 5RNRRN0.000.250.500.751.00θ — relevance ratedensityBeta(1, 1) · priormean = 0.667Beta(8, 4)mean = 0.647Beta(11, 6)P(θ | D)P(D | θ)·P(θ)Beta(α, β) + s/n DATA → Beta(α + s, β + n − s)

Conjugate priors: when updates are closed-form

For specific likelihood-prior pairings — the — the posterior stays in the same family as the prior, so you can update parameters by arithmetic instead of integration. The canonical pair:

Beta-Binomial

Suppose you want to estimate a coin’s bias . Prior: . Observe heads in flips (a Binomial likelihood). Posterior:

You add successes to , failures to . That is the entire update. The prior parameters act like pseudo-counts — a uniform prior is “I have seen 0 prior successes and 0 prior failures”; a is “I have seen 49 prior successes and 49 prior failures.”

Other classics: is conjugate to Gaussian (with known variance), Dirichlet is conjugate to multinomial, Gamma is conjugate to Poisson. These pairings are why Bayesian inference was tractable for decades before MCMC and variational methods made arbitrary posteriors practical.

A worked example: pairwise comparison

Two retrievers, and . You want — the probability produces the better result on a random query. Each query is a Bernoulli trial: wins or wins.

Prior: (uniform on — no preconception). Run 20 paired comparisons, wins 13. Posterior: . Posterior mean: . Posterior 95% credible interval: roughly .

The credible interval’s width tells you how much you should trust the lift. Twenty trials is not enough — the interval crosses , so you cannot rule out that is actually better. The Bayesian framing makes this honest: one number out (the posterior probability that ) plus its uncertainty.

MAP, MLE, and the regularization equivalence

Maximum a posteriori (MAP) estimation picks the mode of the posterior:

The first term is the log-likelihood — what maximizes. The second is a prior penalty.

With a Gaussian prior , the prior log-term is — exactly L2 weight decay. Cranking up the weight-decay coefficient = tightening the prior. Setting it to zero = pure MLE.

The same argument gives Lasso (Laplace prior on ), elastic net (mixture prior), and most other regularizers. Every penalty term you have seen in a deep-learning loss is a log-prior in disguise. Frequentist regularization and Bayesian MAP are the same algorithm under different names.

Where Bayes shows up in modern LLM work

Bayes in production ML
  • — the objective is a tempered posterior with as prior and as likelihood. The fine-tuned policy is the posterior mode at temperature .
  • Variational inference — when the true posterior is intractable, fit a tractable approximation by minimizing . The ELBO drops out as the optimization-friendly form.
  • — turning raw judge outputs into well-calibrated probabilities is posterior estimation: each judge call is evidence, and a shrinks low-sample bins toward sensible defaults.
  • Thompson sampling — sampling from the posterior over arm rewards, then acting greedily, is the optimal exploration-exploitation trade-off for Bernoulli bandits. Used in production for ad ranking and online retrieval-eval comparisons.

It is technically a theorem of measure-theoretic probability — a one-line consequence of the definition of conditional probability applied twice and rearranged. The “rule” naming is historical (Bayes himself proved a special case in 1763) and reflects how it is used: as a procedural rule for updating beliefs, not just an algebraic identity.

The depth comes from what you do with it. Choosing a likelihood model is a modeling decision. Choosing a prior is a (sometimes contentious) modeling decision. Computing the posterior — exactly via conjugacy, approximately via MCMC, approximately-and-fast via variational inference — is the practical art. The rule itself is two lines of arithmetic; the discipline of Bayesian inference is everything you do around it.

The one principle that ties it all together

Every learning algorithm is implicitly answering “what should I believe about given the data I have seen?” — and the only mathematically consistent answer is the posterior. MLE drops the prior. MAP keeps the prior but takes a point estimate. Full Bayesian inference keeps the whole distribution. The rest of ML — regularization, RLHF anchors, calibrated judges, variational methods — is choosing how much of the posterior to keep and how to compute it cheaply.

Go further

What is a conjugate prior, and why does it matter?

A prior is conjugate to a likelihood when the resulting posterior is in the same family as the prior — so you can update parameters in closed form without any integration. Beta is conjugate to Bernoulli/Binomial: a Beta prior plus successes in trials gives a Beta posterior. Gaussian is conjugate to Gaussian (with known variance). These pairings are what made Bayesian inference tractable before MCMC.

How is MAP estimation related to MLE and weight decay?

MAP picks to maximize the posterior . Take logs and you get log-likelihood + log-prior — MLE plus a penalty term. With a Gaussian prior , the penalty is , which is exactly L2 weight decay. Lasso = Laplace prior. Every common regularizer is a log-prior in disguise.

Where does Bayesian thinking show up in modern LLM work?

RLHF's KL penalty is a Bayesian anchor — the reference policy is the prior, the reward is the likelihood, the regularized objective is a posterior mode. Variational inference fits an approximate posterior by minimizing KL to the true one. Score calibration on retrieval judgments is posterior estimation: each judge sample is evidence, and the posterior over relevance shrinks toward a sensible prior.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord