Maximum Likelihood Estimation

Also known as: MLE, maximum likelihood

TL;DR

MLE picks the parameters that maximize the probability of the observed data under the model — equivalently, that minimize negative log-likelihood. Cross-entropy training is MLE under a categorical model.

Maximum likelihood is the principle that almost every modern loss function silently obeys. Given data and a parametric model , the MLE is the parameter value that maximizes the probability of the data under the model:

Switching to the log turns the product into a sum (so it does not underflow with thousands of factors) and turns “maximize” into “minimize negative log-likelihood.” That negative-log-likelihood objective is what gradient descent actually sees during training.

MAXIMUM LIKELIHOODPick the parameter that makes the data most probable.OBSERVED DATA · CANDIDATE MODEL N(μ, 1)-3-2-10123xdensityμΜ TOO LOWx̄ = .45data x_iN(μ, 1)LOG-LIKELIHOOD · L(μ) = Σ log p(x_i | μ)-3-2-10123μL(μ)-21.98μ* = x̄THE SAME PRINCIPLE WEARING DIFFERENT NAMESmin −L(θ)cross-entropy(categorical)MSE(Gaussian)

The unifying view

Every standard supervised loss is MLE under some assumed observation model. Three canonical mappings:

Common losses as MLE
  • Cross-entropy = MLE for a categorical model. The likelihood of label under predicted distribution is ; the negative log-likelihood per example is , which is exactly with a one-hot target.
  • MSE = MLE for a Gaussian observation model with fixed variance. If , the negative log-density is plus a constant. Minimizing MSE is maximizing Gaussian likelihood.
  • Mean absolute error = MLE for a Laplace observation model. The Laplace distribution has heavier tails than the Gaussian, so MAE is more robust to outliers — for the same reason that the Laplace’s log-density is rather than .

The choice of loss is a choice of observation model. When somebody picks MSE without thinking, they have implicitly assumed Gaussian residuals.

Why this is the same problem as KL minimization

The empirical distribution puts mass on each observed sample. Then:

— the empirical cross-entropy from data to model. Add the (constant) entropy of and you get the negative . So:

Maximum likelihood = minimum forward KL between the empirical distribution and the model.

This identity is why works as a language-model evaluation: perplexity is of cross-entropy, which under MLE training is exactly what the optimizer is pushing down.

What MLE buys you asymptotically

Under regularity conditions (smooth likelihood, identifiable parameters, true in the interior of the parameter space), MLE has three asymptotic properties that make it canonical:

  1. Consistency. as .
  2. Asymptotic normality. where is the Fisher information.
  3. Asymptotic efficiency. That variance achieves the Cramér-Rao lower bound — no unbiased estimator does asymptotically better.

This is the reason frequentist statistics ended up here. MLE is, in a precise sense, the best you can do with infinite data and a correctly specified model.

The catch: these are all asymptotic. For finite MLE can be biased (the canonical example is the MLE for a Gaussian variance, which uses in the denominator instead of and is biased downward). And if your model is mis-specified — the truth is not in your model class — MLE converges to the model that minimizes KL to the truth, which may be unsatisfying.

A Bayesian alternative to MLE is maximum a posteriori (MAP) estimation: pick that maximizes the posterior . Taking logs:

The first term is the log-likelihood (what MLE maximizes). The second is a prior penalty. With a Gaussian prior , the prior log-term is plus a constant — a quadratic penalty on the parameters.

That is exactly . So weight decay is MAP estimation with an isotropic Gaussian prior on the weights. Cranking up the weight-decay coefficient means tightening the prior; setting it to zero recovers pure MLE.

The same logic gives you Lasso (Laplace prior), elastic net (mixture prior), and most other regularizers. They are all log-priors in disguise.

When to step away from MLE

MLE has known failure modes that show up in practice:

  • Unbounded likelihood. A Gaussian mixture model can drive likelihood to by collapsing one component onto a single data point with vanishing variance. MLE is technically undefined; in practice you regularize, cap variances, or switch to a Bayesian formulation.
  • Heavy-tailed data. Gaussian-likelihood MLE (MSE) is dominated by outliers because squared-error penalties grow without bound. Switch to MAE (Laplace MLE) or a robust M-estimator.
  • Severe class imbalance. Cross-entropy MLE on imbalanced classification minimizes mean per-example log-loss, which a degenerate “always predict majority” classifier can almost achieve. Re-weight the loss or switch to focal loss.
  • Mis-specified models. If your model class cannot represent the truth, MLE picks the closest member in KL — but “closest in KL” can be far in any other metric. Check held-out fit, not just training likelihood.

These edge cases are why deep-learning practice never runs pure unregularized MLE. The training objective is always MLE plus something — weight decay, dropout, label smoothing, early stopping. Each of those add-ons is interpretable as a Bayesian prior or as a tweak to the likelihood model. The MLE skeleton is still underneath.

Go further

Why is MLE equivalent to minimizing cross-entropy?

Take the negative log of the likelihood, average over the data, and you get the empirical cross-entropy . So minimizing cross-entropy is MLE — the two names describe the same optimization from information-theoretic and statistical angles. The same trick maps mean-squared error onto MLE under a Gaussian noise model: the negative log-density of a Gaussian is plus a constant.

What does it mean that MLE is 'asymptotically efficient'?

Under regularity conditions, as the MLE is consistent (it converges to the true parameter), asymptotically normal (its sampling distribution becomes Gaussian), and asymptotically efficient (its variance achieves the Cramér-Rao lower bound — no unbiased estimator can do better). These three properties are why frequentist statistics canonized MLE as the default. They are asymptotic, though — for small MLE can be biased.

When does MLE fail or need help?

Three classic failure modes. (1) Tiny data: MLE for a Gaussian variance with is zero, which is absurd — a Bayesian prior or just a small additive constant fixes it. (2) Mis-specified models: if your model class can't represent the truth, MLE picks the closest member by KL divergence, which may be far. (3) Unbounded likelihood: a Gaussian mixture's MLE has likelihood if one component shrinks onto a single point. Regularization, priors, and weight decay are the standard fixes.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord