Normal Distribution

Q: Why does the central limit theorem make Gaussians ubiquitous?

The CLT says: if you average FORMULA independent random variables with finite variance, the distribution of that average converges to a Gaussian as FORMULA, regardless of the original distribution. So any quantity that is itself a sum or average of many small independent contributions — measurement noise, network activations, gradient updates — tends to look Gaussian. The pre-activation of a wide neural-net layer is approximately Gaussian for the same reason.

Q: What does '68/95/99.7' actually mean?

For a Gaussian with mean FORMULA and standard deviation FORMULA: about 68% of the mass is within FORMULA of FORMULA, 95% within FORMULA, and 99.7% within FORMULA. This is the rule of thumb that turns abstract probabilities into intuition. A FORMULA event happens about 3 times in 1000; a FORMULA event about once in 1.7 million. Note: tails decay as FORMULA, much faster than power-law distributions where extreme events are common.

Also known as: Gaussian distribution, Gaussian, bell curve

TL;DR

The Gaussian is the bell-curve density . It shows up everywhere because of the central limit theorem.

The normal distribution — Gaussian, bell curve — is parameterized by mean and variance , with density:

It is the most-studied probability distribution in mathematics. It dominates introductory statistics, classical signal processing, finance, and most of machine learning’s theory. Three independent reasons keep pulling it back to centre stage.

Why it is everywhere

Central limit theorem. Average independent random variables with finite variance, normalize by , and as the distribution converges to a Gaussian — no matter what the original distribution was. So any quantity that is itself a sum of many small effects looks Gaussian: thermometer noise, body height, the pre-activation at a wide neural-net layer. The CLT is the meta-reason “noise is Gaussian” is a defensible default modeling choice across most of science.

Maximum entropy. Among all distributions with a given mean and variance, the Gaussian has the largest entropy. So if you assume nothing about a noise source other than its variance, the maximum-entropy / minimum-information principle tells you to model it as Gaussian. This is the information-theoretic counterpart to the CLT.

Closure under linear operations. Linear combinations of Gaussians are Gaussian. Sums, differences, and projections all stay in the family. Conditional distributions of joint Gaussians are Gaussian. This algebraic closure is why classical theory — Kalman filters, linear regression, PCA — has clean closed-form answers.

The 68/95/99.7 rule

For a one-dimensional Gaussian with mean and standard deviation :

This is the rule of thumb that converts abstract variance into actionable intuition. If a metric’s per-sample SD is 0.04 NDCG and you observe a +0.03 difference, you are inside one standard deviation of zero — easily noise. A difference is roughly — borderline significant. Pair this with paired tests for honest reporting (see MLE -driven likelihood ratios for the formal version).

In high dimensions

A multivariate Gaussian is parameterized by mean vector and covariance matrix :

In high dimensions Gaussians develop counter-intuitive geometry. A standard isotropic Gaussian in dimensions concentrates almost all its mass in a thin shell of radius — not near the origin. This “concentration of measure” is what makes the Johnson-Lindenstrauss lemma work: random Gaussian projections approximately preserve pairwise distances precisely because high-dimensional Gaussian samples have predictable norms.

The same geometry underwrites random initialization for neural networks. A weight matrix sampled iid Gaussian acts like a random linear map with controlled spectral norm, which is why Xavier and Kaiming work.

In deep learning specifically

Where Gaussians appear in modern ML

Initialization. Xavier () and Kaiming () draw weights iid from scaled Gaussians to preserve activation variance through depth.
Regularization. L2 / weight decay is equivalent to a zero-mean Gaussian prior on weights (see below).
Regression losses. MSE is the negative log-likelihood of a Gaussian observation model — assuming Gaussian residuals is what makes squared error the right loss.
VAEs and diffusion. Latent priors are Gaussian; diffusion forward processes add Gaussian noise on a schedule. The reverse process learns to denoise, leveraging the closure properties of Gaussians.
Approximate posteriors. Variational inference frequently approximates intractable posteriors with Gaussian families (mean-field VI).
Embedding analysis. Distributions of embedding components are approximately Gaussian for many models — a useful working assumption for normalization and PCA-like analyses.

Maximum a posteriori (MAP) estimation maximizes . With a zero-mean Gaussian prior :

Adding this to the log-likelihood and negating gives the loss

— the original loss plus an L2 penalty with strength . So weight decay with coefficient is MAP estimation under a Gaussian prior with variance . Cranking up tightens the prior toward zero; setting it to zero recovers pure maximum likelihood .

This is the cleanest example of “regularization is a prior” — a principle that generalizes to L1 (Laplace prior), elastic net (mixture prior), and most other penalty terms used in deep learning.

When the Gaussian assumption breaks

The CLT gives you Gaussianity only when individual contributions have finite variance and are roughly comparable in size. Several common situations break it:

Heavy tails — financial returns, network latencies, and word frequencies follow power laws or log-normal distributions. A few rare extreme events dominate. Modeling these as Gaussian under-states tail risk dramatically.
Multi-modal distributions — bimodal data (e.g., scores from two mixed populations) cannot be summarized by a single mean and variance. Gaussian mixture models exist for exactly this reason.
Bounded support — a Gaussian assigns nonzero density to every real number. For probabilities, proportions, or counts, Beta / Dirichlet / Poisson distributions are more appropriate.
Adversarial inputs — under attack, your input distribution is whatever an adversary chose. The CLT assumes iid sampling; a worst-case sequence is iid only of itself.

The default in ML is still “Gaussian unless I know otherwise,” and that default is usually fine. But knowing why it is the default — and where it stops being defensible — is what separates calibrated uncertainty estimates from confident-sounding nonsense.

Go further

Why does the central limit theorem make Gaussians ubiquitous?

The CLT says: if you average independent random variables with finite variance, the distribution of that average converges to a Gaussian as , regardless of the original distribution. So any quantity that is itself a sum or average of many small independent contributions — measurement noise, network activations, gradient updates — tends to look Gaussian. The pre-activation of a wide neural-net layer is approximately Gaussian for the same reason.

Johnson-Lindenstrauss lemma

What does '68/95/99.7' actually mean?

For a Gaussian with mean and standard deviation : about 68% of the mass is within of , 95% within , and 99.7% within . This is the rule of thumb that turns abstract probabilities into intuition. A event happens about 3 times in 1000; a event about once in 1.7 million. Note: tails decay as , much faster than power-law distributions where extreme events are common.

Statistical significance

Why do Xavier and Kaiming initialization use Gaussians?

Both schemes initialize weights from Gaussians scaled to keep the variance of activations approximately constant across layers. Xavier () targets symmetric activations like tanh; Kaiming () accounts for the fact that ReLU zeros out half its input. The Gaussian itself is convenient — but the real job is variance preservation, so that signals neither vanish nor explode through depth.

Weight decay Embedding

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs