Also known as: MI, Shannon mutual information, I(X;Y)
TL;DR
Mutual information is the reduction in uncertainty about once you observe . It is the symmetric, information-theoretic measure of how much two variables share.
Mutual information measures how much knowing reduces uncertainty about (and vice versa — it is symmetric). The clean definition:
I(X; Y) = H(X) - H(X | Y) = H(Y) - H(Y | X)
Equivalently, it is the KL divergence between the joint distribution and the product of marginals :
I(X; Y) = KL( p(x, y) || p(x) p(y) )
Zero mutual information means and are independent. Larger MI means they share more structure. Unlike correlation, mutual information catches non-linear relationships — and almost everything a neural network learns is non-linear.
Why it matters for representation learning
Modern embedding training is, at heart, a mutual-information maximization. A query and its true positive document share something — a topic, an entity, an answer. Training an encoder to maximize for matched pairs and minimize it for mismatched pairs produces representations where similarity carries semantic signal.
Direct estimation of from samples is hard — it is a high-dimensional density-ratio estimation problem. The breakthrough that powers most modern embedders is that you can lower-bound mutual information with a tractable loss:
InfoNCE , the contrastive loss behind essentially every production embedder, is provably a lower bound on . Optimizing it pushes the bound up, which pushes the true MI up, which makes the embeddings more informative.
More dimensions, more capacity
. The mutual information between two representations is upper-bounded by the entropy of either one — and the entropy of a continuous-valued vector scales with its dimensionality (and its bit-precision per component).
This is the information-theoretic reason embedding dimension matters. A 32-dim float32 vector can carry far less mutual information with the input than a 1024-dim one. There is a real ceiling: you cannot encode 50 nats of relevance information in a 32-dim representation, no matter how clever your loss.
In practice, the relationship is sublinear — going from 768 to 1536 dimensions does not double retrieval quality. But the floor effect is real: embeddings that are too small can’t carry enough mutual information with the labels to discriminate, full stop.
Where MI shows up beyond embeddings
Mutual information in ML practice
Feature selection — choose features with high MI to the target; drop low-MI features as redundant.
Information bottleneck — train representations that maximize while minimizing . Encourages compressing irrelevant detail.
Disentanglement — penalize MI between latent dimensions to force them to capture independent factors.
Self-supervised pretraining — SimCLR, BYOL, DINO all approximate MI maximization between augmentations of the same image.
Generative-model evaluation — MI between latents and outputs diagnoses mode collapse and disentanglement.
The definition requires the density ratio of the joint over the product of marginals. In high dimensions, you typically have only samples — no closed-form densities — and density estimation degrades exponentially with dimensionality.
The classical trick is binning: discretize each variable into bins, count co-occurrences, plug into the discrete formula. That works for low dimensions but is hopeless for 1024-dim embeddings.
Modern practice uses variational lower bounds. MINE (Mutual Information Neural Estimation), InfoNCE, and friends train a critic network to approximate the density ratio implicitly, then plug the output into a bound on . InfoNCE is the most practical because the bound is loss-shaped — minimize the loss, maximize the lower bound, train the encoder. The trade-off is that the bound is loose unless (the number of negatives) is large, which is why contrastive batch sizes are typically several thousand for serious models.
Mutual information is the substrate beneath most representation learning, even when nobody names it. Once you see that “maximize MI between query and positive” is what InfoNCE does, the standard moves in embedding training — large batches, hard negatives, high dimensionality — stop looking like folklore and start looking like tightening a bound.
Go further
How does InfoNCE bound mutual information?
InfoNCE, the loss behind most modern embedding models, is a lower bound on I(X; Y) between the anchor and its positive. The bound tightens as the number of negatives grows — which is why batch size and hard-negative mining matter so much for embedding quality.
Why does dimensionality matter for capturing mutual information?
Mutual information between two variables is bounded above by the entropy of either one — and entropy scales with the dimensionality of the representation. A 32-dim embedding can only encode about 32 log(precision) nats of information. More dimensions buy more capacity, up to the noise floor of the data.
No. Correlation captures linear dependence only; two variables can have zero correlation but very high mutual information (a sine wave and its argument, for instance). MI is the right measure when the relationship is non-linear, which is most of what neural networks learn.