Entropy

Also known as: Shannon entropy, information entropy, H(X)

TL;DR

Entropy is the average number of nats (or bits) needed to encode samples from . It is the unit of uncertainty.

Entropy is the unit of uncertainty. For a discrete distribution over outcomes :

Read aloud: the expected surprisal of a sample from . The unit is nats when the log is natural, bits when it is base-2. Everything else in information theory — and in any ML loss with a log in it — is a corollary.

ENTROPYUncertainty has a number.UNIFORMp over 6 outcomesp₁ … p₆0.170.170.170.170.170.17−pᵢ log pᵢ0.2990.2990.2990.2990.2990.299H=1.792NATSPEAKEDp over 6 outcomesp₁ … p₆0.700.100.100.050.030.02−pᵢ log pᵢ0.2500.2300.2300.1500.1050.078H=1.043NATSDELTAp over 6 outcomesp₁ … p₆1.000.000.000.000.000.00−pᵢ log pᵢ000000H=0.000NATSKERNEL · −x log x for x ∈ [0, 1]max at x = 1/e ≈ 0.368value = 1/e ≈ 0.3680.000.250.500.751.00H(p) = Σ −pᵢ log pᵢ — SUM OF PER-OUTCOME SURPRISE

The three properties that matter

Properties of $H$
  • Non-negative. , with equality iff is a delta (one outcome with probability 1). A deterministic variable carries no uncertainty.
  • Maximum at uniform. , with equality iff every outcome is equally likely. Uniform = maximally uncertain.
  • Additive on independent variables. when . Independent randomness composes by addition; this is why entropies of stacked iid samples scale linearly with sequence length.

These three together are the axioms Shannon used to derive the formula. If you want a measure of uncertainty that vanishes on certainties, peaks on uniform, and adds across independent variables, is the unique answer up to a multiplicative constant (the choice of log base).

The coding interpretation

Shannon’s source coding theorem makes this concrete: the minimum expected code length, in bits per symbol, for losslessly encoding draws from is exactly . Huffman coding gets within one bit per symbol; arithmetic coding closes the gap asymptotically. So entropy is literally the floor on compression.

Entropy is the minimum description length per sample. Every “minimum description length” framework — MDL model selection, Solomonoff induction, Bayesian model averaging at the Occam-factor limit — reduces to this single quantity.

Why every log-loss in ML descends from this

Three corollaries you have already used without naming:

Entropy's children
  • — the expected code length when you encode draws from using a code optimized for . This is the loss function for classification and next-token prediction.
  • — the extra bits paid for using ‘s code on ‘s data. Drives RLHF, distillation, variational inference.
  • — how much uncertainty about is removed by observing . The objective behind contrastive learning and InfoNCE.

Each of these is a one-line algebraic manipulation of . If you can derive entropy from scratch you can derive all three; if you cannot, you are memorizing formulas that happen to work.

Continuous variables: differential entropy

For a continuous density :

This is differential entropy and it is subtler than the discrete case. It can be negative (a tight Gaussian has ), and it is not invariant under change of variables. The right invariant quantity is relative entropy — KL divergence — which always behaves itself. When you see “entropy of a Gaussian” in a paper, it is almost always differential entropy, and the practitioner is implicitly relying on KL canceling out the unit-of-measure pathology.

Suppose you want a distribution over classes with for some feature function — i.e., matches a moment of the data. Among all distributions satisfying that constraint, the one with maximum entropy is exponential-family:

For classification with (a one-hot), this is exactly over . So softmax isn’t just a convenient parameterization; it is the unique maximum-entropy distribution consistent with linear features. That is why pairing softmax with cross-entropy gives well-behaved gradients — you are doing constrained maximum entropy, dual to on a log-linear model.

The one-sentence summary

Entropy is the average surprisal of a sample. Cross-entropy is the average surprisal under the wrong code. KL is the difference. Perplexity is the exponential. Mutual information is the conditional reduction. Every information-theoretic quantity in ML is one of those four manipulations of , and if you have you have them all.

Go further

Why is the maximum of entropy at the uniform distribution?

Concavity of the log. By Jensen's inequality, with equality iff . Intuition: the uniform distribution is maximally non-committal — every outcome is equally surprising, so the average surprisal is as large as the alphabet allows. Any concentration of mass shortens the description on average.

What is the relationship between entropy and perplexity?

Perplexity is (or in bits). It converts a log-scale uncertainty into a linear 'effective number of choices.' A language model with cross-entropy 3.0 nats has perplexity ≈20 — on average it is as uncertain as if choosing uniformly among 20 next tokens. Reporting perplexity instead of raw cross-entropy is mostly cosmetic, but the linear unit is easier to compare across vocabulary sizes.

Where does the coding interpretation come from?

Shannon's source coding theorem. The minimum expected code length for symbols drawn from is bits per symbol — achievable in the limit by Huffman or arithmetic coding. So entropy literally is the floor on lossless compression. 'Minimum description length' frameworks (MDL, Solomonoff induction, even regularized MLE) all reduce to this: the best model is the one whose code for the data plus its own code is shortest.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord