Information Bottleneck

Also known as: IB principle, Tishby IB, information bottleneck principle

TL;DR

The information bottleneck principle frames learning as a compression problem: find a representation T of input X that throws away every bit of X that is not informative about the target Y. Formally, maximize I(T; Y) while minimizing I(X; T).

The information bottleneck principle, due to Tishby, Pereira & Bialek (1999), reframes supervised learning as a constrained compression problem. Given input X and target Y, find a representation T that:

— compress X as much as possible (minimize , the mutual information between input and representation) while preserving as much information about Y as possible (maximize ). The parameter controls the compression-fidelity tradeoff.

A good representation is one that has thrown away every bit of input that did not help predict the output. Compression is not a side effect of learning; it IS learning.

Why this framing is useful

The IB lens unifies a surprising amount of representation-learning theory:

Why embeddings are smaller than inputs. A 768-dim embedding compresses the 50K-vocab-token-by-context-window input into 768 numbers. The compression is the work. The reason 768 dims often suffices is that the predictive information about most tasks is much lower-dimensional than the raw input.
Why bottleneck architectures work. Autoencoders, encoder-decoder transformers, U-Nets — all force information through a narrow channel. The narrowness is the compression constraint. The reconstruction (or downstream-task) loss is the term.
Why knowledge distillation works. Distillation transfers the teacher’s compressed representation (with most input-irrelevant bits already discarded) to a smaller student. The IB lens explains why distilled students sometimes generalize better than the teacher: the compression bias.
Why dropout / weight decay generalize. Both add noise to T, increasing the compression pressure. Reg = compression force.

The two terms, concretely

A representation that just copies X has high (perfect memorization) and as much as X itself has. That’s not a useful representation. A representation that ignores X has zero but also zero . Useless. The interesting representations live in the middle: enough compression that irrelevant input dimensions are gone, enough fidelity that label-predictive information survives.

IB in retrieval and ranking

The training pipeline for embedding models and rerankers is, structurally, an information bottleneck:

Input X: a query and a document (or a token sequence).
Representation T: a fixed-size vector (embedding) or a scalar (reranker score).
Target Y: relevance — does this document satisfy this query.
Compression pressure: bounded dimensionality (768, 1024, 1536) acts as the ceiling. The model cannot remember everything; it must choose what to keep.
Fidelity pressure: contrastive loss (InfoNCE) lower-bounds . The model is rewarded for keeping relevance-predictive bits.

The matryoshka trick — training nested truncations of one embedding to all be usable — is essentially training multiple points along the IB curve simultaneously.

Open questions and limitations

Two reasons:

MI is hard to estimate from samples. Both terms in the IB objective are mutual informations, which are notoriously high-variance to estimate without strong distributional assumptions. Variational lower bounds (MINE, InfoNCE) help on ; the side typically requires upper bounds via tractable noisy channels (Gaussian, dropout).
The optimal depends on the task. Too small a produces representations that ignore Y; too large a produces representations that ignore the compression objective and just memorize X. In practice, is replaced by architectural choices (bottleneck dimension, dropout rate, weight decay) that act as implicit compression constraints.

This is why IB is more useful as a framing than as a direct training loss. The principle informs architecture and regularization choices; the actual optimization happens via downstream objectives.

They challenged the strongest version of Tishby’s claim — that SGD induces a two-phase fitting-then-compression dynamic visible in vs curves. Saxe et al. (2018) showed the empirical phase transition Tishby observed depends on specific choices about how MI is estimated, and that it does not robustly appear across architectures or activation functions.

What survives the critique: IB is still a useful framing for what good representations are, even if it is not the full story of how SGD gets there. The compression-as-generalization intuition holds; the specific dynamics are open.

The honest take

IB is a lens, not an algorithm. You will rarely write an IB loss directly. But “what is this representation throwing away, what is it keeping” is the right mental model for diagnosing why an embedding fails on a new domain, why a distilled student generalizes better than its teacher, or why your 256-dim embedding suddenly drops 8 NDCG points compared to 1024. The compression frame makes those failures legible.

Go further

Does the information bottleneck explain deep learning?

Tishby's strong claim — that SGD induces a two-phase compression dynamic where networks first fit then compress — was contested by Saxe et al. (2018) and the empirical evidence is mixed. What survived: IB as a useful framing for what good representations look like, even if it does not perfectly explain the optimization dynamics. The compression-as-generalization intuition is genuinely load-bearing for understanding embeddings, distillation, and matryoshka representations.

Knowledge distillation

How does IB connect to contrastive learning?

InfoNCE explicitly lower-bounds I(T; Y) where T is the embedding and Y is the positive pair. Maximizing the InfoNCE objective is approximately maximizing the I(T; Y) term of the IB objective. The I(X; T) compression side is implicit — bounded model capacity acts as the compression constraint.

InfoNCE loss Contrastive learning

What does IB say about embedding dimension choice?

Smaller dimensions tighten the I(X; T) compression. Larger dimensions retain more information from X, including irrelevant information. The right dimension balances enough capacity to preserve I(T; Y) (label-predictive information) against enough compression to throw out everything else. Matryoshka representations exploit this directly by training nested embeddings of varying dimensions.

Matryoshka representation learning

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs