Also known as: Kullback-Leibler divergence, relative entropy, KL
TL;DR
KL divergence measures how far one distribution is from another, in nats. It is asymmetric, non-negative, and zero only when the two distributions are identical.
KL divergence is how machine learning measures the gap between two probability distributions. The formula:
KL(P || Q) = Σ p(x) log( p(x) / q(x) )
Read aloud: the expected log-ratio between and , taken under . It is non-negative, equals zero exactly when , and grows as the two distributions diverge. The unit is nats when the log is natural, bits when it is base-2.
Why it is not a distance
KL is asymmetric — and are different numbers and have different geometric meaning. It also fails the triangle inequality. So mathematicians call it a divergence, not a metric. The asymmetry is not a bug; it is the whole point.
is finite only on the support of . If puts mass somewhere puts zero, the divergence is infinite — makes the log blow up. This is why direction matters: forward and reverse KL care about different failure modes.
Forward vs reverse KL
This is the practical distinction that bites:
Forward KL — mode-covering. The model is penalized infinitely if it puts zero mass anywhere the data has mass, so it spreads out to cover every mode. Maximum-likelihood training is forward KL: it makes the model average over all plausible explanations.
Reverse KL — mode-seeking. The model is penalized for putting mass anywhere the data does not, so it collapses onto a single high-probability mode. RL fine-tuning with a KL penalty leans this direction.
Where it shows up
KL in production ML
Cross-entropy training — minimizing cross-entropy is forward KL between data and model, up to an additive constant.
RLHF and DPO — a term keeps the fine-tuned policy from drifting too far from the base model.
Knowledge distillation — student is trained to match teacher’s softmax via (or symmetric variants).
Variational inference — the ELBO is a lower bound derived from KL between an approximate posterior and the true one.
Calibration evaluation — reliability diagrams and ECE are summary statistics; a KL between predicted and empirical bin frequencies is a fuller measure.
The relationship to entropy
KL decomposes into entropy and cross-entropy:
KL(P || Q) = H(P, Q) - H(P)
Where is cross-entropy and is the entropy of . Since is fixed by the data, minimizing cross-entropy and minimizing forward KL are the same optimization problem. The information-theoretic story is that KL measures the extra code length — over the optimal — that you pay for using a model instead of the truth.
The reward r pulls the policy toward high-reward responses; the KL penalty pulls it back toward the reference (typically the supervised fine-tuned model). The hyperparameter β is the leash length.
Why is this needed? Because reward models are imperfect — they have adversarial examples and out-of-distribution failures. Without the KL anchor, a policy will find and exploit reward-model bugs, producing high-reward gibberish. The KL term keeps the policy on the manifold of plausible language by penalizing it for putting probability mass where the reference would not.
DPO implements the same KL-regularized RL objective in closed form, without an explicit reward model — but the mathematical role of the KL term is identical. Tuning β is one of the main knobs in alignment training: too low and the policy drifts; too high and the reward signal can’t move it.
KL divergence fits in one line but ripples through every part of modern ML. If you cannot derive forward versus reverse KL by hand, you cannot reason about why your fine-tuning run is collapsing, why your distillation is over-smoothing, or why your variational posterior is too tight.
Go further
Why is KL divergence not a true distance?
It violates two metric axioms. It is asymmetric — KL(P || Q) ≠ KL(Q || P) in general — and it does not satisfy the triangle inequality. It does satisfy non-negativity and KL(P || P) = 0, which is why it is still a useful divergence even though it is not a metric.
RLHF and DPO both add a KL penalty to the policy loss — KL(π_new || π_ref) — to keep the fine-tuned model close to the base. Knowledge distillation minimizes KL(student || teacher) over softmax outputs. Variational inference minimizes KL between a learned posterior and a prior.
Yes, dramatically. Forward KL KL(P || Q) is mode-covering — Q spreads to cover every mode of P. Reverse KL KL(Q || P) is mode-seeking — Q collapses onto a single mode. Choice of direction determines whether your fitted distribution is broad or sharp.