Conjugate Prior

Also known as: conjugate distribution, conjugacy, conjugate family

TL;DR

A prior distribution is conjugate to a likelihood when multiplying them produces a posterior in the same family as the prior — so Bayesian updates reduce to arithmetic on the prior's parameters instead of an integral.

A conjugate prior is a prior distribution whose mathematical form is chosen to play nicely with a particular likelihood: when you multiply them together via and renormalize, the resulting posterior is in the same family as the prior, just with updated parameters. The intractable integral that normally lurks in the denominator of Bayes’ rule disappears, and the posterior update reduces to arithmetic on the prior’s parameters.

That property is the entire reason the term exists. Without conjugacy, computing a posterior is an integration problem; with it, it’s a counting problem.

The shape of conjugacy

Let be a parameter (say, a coin’s bias). The Bayesian update is:

For an arbitrary prior, the right-hand side is a general function of — you have to integrate to renormalize and you don’t get a named distribution back. A conjugate prior is chosen so that the multiplication preserves the functional form. If the prior is and the likelihood is Bernoulli, the product is proportional to — which is again a Beta, just with updated parameters.

The canonical pairs

Conjugate pairings you'll see in production
  • is conjugate to Bernoulli / Binomial — prior + successes in trials → posterior . Used for win-rate estimation, A/B tests, bandit arms, and pairwise-preference aggregation.
  • Dirichlet is conjugate to categorical / multinomial — prior + observed counts → posterior . The Beta is a 2-class special case. Used for class-distribution estimation, topic models, smoothed language models.
  • is conjugate to Normal (with known variance) — prior + samples with mean → posterior is again a Normal with precision-weighted updated mean. Used for online mean estimation, Kalman filtering, ridge-regression equivalence.
  • Gamma is conjugate to Poisson — prior + observations summing to → posterior . Used for rate-parameter inference: page-view rates, click-through rates, request frequencies.
  • Normal-inverse-Wishart is conjugate to Gaussian with unknown mean and covariance — the multivariate generalization of the previous case. Used in Bayesian regression and Gaussian process hyperparameter inference.

Every one of these is in the exponential family on both sides — that is not a coincidence. The natural conjugate prior for any exponential-family likelihood is itself exponential-family, and falls out mechanically from writing the likelihood in canonical form. Outside the exponential family there is generally no closed-form conjugate.

Pseudo-count intuition

The cleanest mental model for a conjugate prior is pseudo-counts: the prior’s parameters act as if you had observed a phantom set of trials before any real data arrived.

A prior on a coin’s bias is exactly “I have seen prior successes and prior failures.” Updating with real successes in real trials just adds them to the pile: posterior . The prior strength is the effective sample size of the prior.

This is what makes conjugate priors tunable in a way that vaguer priors aren’t. “Prior strength of 10” is a sentence you can say out loud, defend, and adjust. “I chose this kernel and these hyperparameters for my Gaussian process prior” is a sentence that requires a paragraph of justification.

A concrete update

Two retrievers, and . You want . Each query is a Bernoulli trial.

StateDistributionPseudo-counts
Prioruniform — 0 prior wins, 0 prior losses
After 7 wins, 3 losses7 wins, 3 losses
After 3 more wins, 2 losses10 wins, 5 losses

Posterior mean after 15 queries: . No integrals, no sampler, no convergence checks. Just addition.

That is the production payoff. When you need to update a belief 10,000 times per second, conjugacy isn’t a mathematical curiosity — it’s the only thing that runs.

When conjugacy stopped mattering, then started again

From the 18th century through the 1980s, conjugate priors were the only practical way to do Bayesian inference. Anything non-conjugate required numerical integration that was infeasible by hand and slow on early computers. Whole subfields of statistics organized themselves around which families had clean conjugate pairs.

MCMC (1990s) and variational inference (2000s–2010s) broke that constraint. Today you can fit an approximate posterior to nearly any likelihood-prior combination — Stan, PyMC, NumPyro, Pyro all do this routinely. Conjugacy stopped being a requirement.

But it stayed a preference in two specific regimes that matter for modern ML:

Where conjugate priors are still load-bearing
  • Online updates — bandits, streaming win-rate estimation, online A/B testing. You can’t run an MCMC sampler in a request handler. A conjugate update is a few floating-point ops and fits in the latency budget.
  • Hierarchical hyperpriors inside larger samplers — even when the top-level posterior is sampled, inner conjugate updates inside Gibbs sweeps are essentially free, so the overall sampler is much faster.
  • — when you combine noisy preference signals (judge LLMs voting on a pair, say), a Beta posterior on each pair gives you both the mean and the variance, and the variance is what tells you how much to trust each comparison.
  • Score calibration on judgments — a Beta prior on the per-bin precision shrinks low-sample bins toward sensible defaults without introducing the variance of a sampler-based estimate.

The term was coined by Howard Raiffa and Robert Schlaifer in their 1961 monograph Applied Statistical Decision Theory. The mathematical sense of “conjugate” — two things that pair cleanly under some operation — is the same as in complex conjugate ( and pair under multiplication to give a real) or conjugate variables in Hamiltonian mechanics. A conjugate prior is the prior that “pairs cleanly” with a likelihood under Bayesian updating, in the sense that multiplication preserves the family.

The deeper structural fact is that for any exponential-family likelihood, there is a natural conjugate prior, and its parameters have a clean interpretation as prior sufficient statistics — pseudo-data with which to seed the inference. That observation underwrites the entire toolkit.

The takeaway

Conjugate priors are the special case of Bayesian inference where the math closes. They are not the right tool for every problem — but in the problems they fit, they are dramatically better than the alternatives. Closed-form means fast, exact, and interpretable; non-conjugate means correct but slow, approximate, and harder to reason about. When you reach for a Beta on a probability or a Gamma on a rate, you are choosing the family precisely because the update will fit in a single line of code.

Go further

If MCMC and variational inference handle any posterior, why still care about conjugate priors?

Three reasons. (1) Speed — a conjugate update is a few arithmetic ops, MCMC takes thousands of samples. For online updates (bandit arms, streaming win-rate estimation, A/B tests), conjugacy is the only thing that runs fast enough. (2) Exactness — no sampling noise, no convergence diagnostics, no chain-mixing failures. (3) Interpretability — pseudo-count semantics (a prior is 'I've seen prior successes and prior failures') make the prior's effect on the posterior easy to reason about and easy to tune.

How do I find the conjugate prior for a given likelihood?

If the likelihood is in the exponential family — Bernoulli, Binomial, Poisson, Gaussian, multinomial, exponential, gamma all are — its natural conjugate prior is also in the exponential family, and it falls out of writing the likelihood in canonical form. In practice you reach for a lookup table: Beta for Bernoulli/Binomial, Dirichlet for multinomial/categorical, Gamma for Poisson and for the precision of a Gaussian with known mean, Normal for the mean of a Gaussian with known variance, Normal-inverse-Wishart for a Gaussian with unknown mean and covariance.

Does picking a conjugate prior bias my results?

It biases them exactly as much as any prior choice biases them — conjugate priors aren't special in that regard. What is special is that the shape of the conjugate family is usually rich enough to encode genuinely uninformative priors (Beta(1,1) is uniform), weakly informative priors (Beta(2,2) is gently mode-seeking), and strongly informative ones (Beta(50,50) is 'I've effectively seen 100 trials already'). If your conjugate family can't express the prior knowledge you actually have, switch to a non-conjugate prior and pay the MCMC tax.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord