MLE picks the parameters that maximize the probability of the observed data under the model — equivalently, that minimize negative log-likelihood. Cross-entropy training is MLE under a categorical model.
Maximum likelihood is the principle that almost every modern loss function silently obeys. Given data and a parametric model , the MLE is the parameter value that maximizes the probability of the data under the model:
Switching to the log turns the product into a sum (so it does not underflow with thousands of factors) and turns “maximize” into “minimize negative log-likelihood.” That negative-log-likelihood objective is what gradient descent actually sees during training.
The unifying view
Every standard supervised loss is MLE under some assumed observation model. Three canonical mappings:
Common losses as MLE
Cross-entropy = MLE for a categorical model. The likelihood of label under predicted distribution is ; the negative log-likelihood per example is , which is exactly cross-entropy with a one-hot target.
MSE = MLE for a Gaussian observation model with fixed variance. If , the negative log-density is plus a constant. Minimizing MSE is maximizing Gaussian likelihood.
Mean absolute error = MLE for a Laplace observation model. The Laplace distribution has heavier tails than the Gaussian, so MAE is more robust to outliers — for the same reason that the Laplace’s log-density is rather than .
The choice of loss is a choice of observation model. When somebody picks MSE without thinking, they have implicitly assumed Gaussian residuals.
Why this is the same problem as KL minimization
The empirical distribution puts mass on each observed sample. Then:
— the empirical cross-entropy from data to model. Add the (constant) entropy of and you get the negative KL divergence . So:
Maximum likelihood = minimum forward KL between the empirical distribution and the model.
This identity is why perplexity works as a language-model evaluation: perplexity is of cross-entropy, which under MLE training is exactly what the optimizer is pushing down.
What MLE buys you asymptotically
Under regularity conditions (smooth likelihood, identifiable parameters, true in the interior of the parameter space), MLE has three asymptotic properties that make it canonical:
Consistency. as .
Asymptotic normality. where is the Fisher information.
Asymptotic efficiency. That variance achieves the Cramér-Rao lower bound — no unbiased estimator does asymptotically better.
This is the reason frequentist statistics ended up here. MLE is, in a precise sense, the best you can do with infinite data and a correctly specified model.
The catch: these are all asymptotic. For finite MLE can be biased (the canonical example is the MLE for a Gaussian variance, which uses in the denominator instead of and is biased downward). And if your model is mis-specified — the truth is not in your model class — MLE converges to the model that minimizes KL to the truth, which may be unsatisfying.
A Bayesian alternative to MLE is maximum a posteriori (MAP) estimation: pick that maximizes the posterior . Taking logs:
The first term is the log-likelihood (what MLE maximizes). The second is a prior penalty. With a Gaussian prior , the prior log-term is plus a constant — a quadratic penalty on the parameters.
That is exactly L2 weight decay . So weight decay is MAP estimation with an isotropic Gaussian prior on the weights. Cranking up the weight-decay coefficient means tightening the prior; setting it to zero recovers pure MLE.
The same logic gives you Lasso (Laplace prior), elastic net (mixture prior), and most other regularizers. They are all log-priors in disguise.
When to step away from MLE
MLE has known failure modes that show up in practice:
Unbounded likelihood. A Gaussian mixture model can drive likelihood to by collapsing one component onto a single data point with vanishing variance. MLE is technically undefined; in practice you regularize, cap variances, or switch to a Bayesian formulation.
Heavy-tailed data. Gaussian-likelihood MLE (MSE) is dominated by outliers because squared-error penalties grow without bound. Switch to MAE (Laplace MLE) or a robust M-estimator.
Severe class imbalance. Cross-entropy MLE on imbalanced classification minimizes mean per-example log-loss, which a degenerate “always predict majority” classifier can almost achieve. Re-weight the loss or switch to focal loss.
Mis-specified models. If your model class cannot represent the truth, MLE picks the closest member in KL — but “closest in KL” can be far in any other metric. Check held-out fit, not just training likelihood.
These edge cases are why deep-learning practice never runs pure unregularized MLE. The training objective is always MLE plus something — weight decay, dropout, label smoothing, early stopping. Each of those add-ons is interpretable as a Bayesian prior or as a tweak to the likelihood model. The MLE skeleton is still underneath.
Go further
Why is MLE equivalent to minimizing cross-entropy?
Take the negative log of the likelihood, average over the data, and you get the empirical cross-entropy . So minimizing cross-entropy is MLE — the two names describe the same optimization from information-theoretic and statistical angles. The same trick maps mean-squared error onto MLE under a Gaussian noise model: the negative log-density of a Gaussian is plus a constant.
What does it mean that MLE is 'asymptotically efficient'?
Under regularity conditions, as the MLE is consistent (it converges to the true parameter), asymptotically normal (its sampling distribution becomes Gaussian), and asymptotically efficient (its variance achieves the Cramér-Rao lower bound — no unbiased estimator can do better). These three properties are why frequentist statistics canonized MLE as the default. They are asymptotic, though — for small MLE can be biased.
Three classic failure modes. (1) Tiny data: MLE for a Gaussian variance with is zero, which is absurd — a Bayesian prior or just a small additive constant fixes it. (2) Mis-specified models: if your model class can't represent the truth, MLE picks the closest member by KL divergence, which may be far. (3) Unbounded likelihood: a Gaussian mixture's MLE has likelihood if one component shrinks onto a single point. Regularization, priors, and weight decay are the standard fixes.