Orthogonality Concentration

Q: How does feature superposition exploit this?

Because there's room for vastly more near-orthogonal directions than strictly orthogonal ones, models can pack many more learned features into a fixed dim count than naive linear-algebra intuition suggests. That's how a 4096-dim model can encode tens of thousands of distinguishable concepts.

Q: What does this mean for embedding similarity scores?

A trained-embedding cosine of 0.85 is a strong signal — random pairs sit near 0, so anything well above the noise floor reflects real learned alignment. But that also means absolute thresholds don't transfer between models; calibration is per-model.

Also known as: near-orthogonality, high-dimensional orthogonality

TL;DR

In high dimensions, two random vectors are almost always nearly orthogonal — their cosine similarity concentrates sharply around 0. The reason untrained embeddings give noise and why training has to actively fight the geometry.

In high-dimensional space, two random unit vectors are nearly orthogonal almost surely. Concretely: for two independently drawn unit vectors in d dimensions, the probability that their cosine similarity exceeds ε in absolute value is bounded by:

The exponential decay in d is brutal. At d = 1024 and ε = 0.1, the bound is about . So less than 1.2% of random vector pairs deviate from orthogonality by more than 0.1. Crank d up to 2048 and ε down to 0.05 and the bound drops to — even relatively lenient ε bounds become unreliable signals.

Why this is a problem (in theory)

If random embeddings have cosine ≈ 0 with overwhelming probability, then “high cosine = relevant” works only because trained embeddings have learned to deviate from random. The training signal pulls semantically similar pairs to cosine close to 1, which is far from where untrained vectors would land. The geometry of high-D space is fighting the model the entire training run.

This is the formal version of the “curse of dimensionality is also redundancy” intuition: the space is so vast that almost all of it is noise, but the meaningful signal lives in a much smaller learned region.

Why this is a feature (in practice)

The same concentration that makes random vectors useless makes learned directions efficient. A trained 2048-dim space can pack tens of thousands of near-orthogonal directions — far more than you could ever name as concepts.

This is feature superposition: LLMs and embedding models use many near-orthogonal-but-not-quite directions to encode many more features than the dimensionality would naively suggest. Sparse autoencoders are one way to disentangle that superposition back into individual features.

The classical upper bound on strictly orthogonal directions in is exactly — that’s a basis. But if you allow ε-near-orthogonality (every pair has ), the bound jumps exponentially. A standard JL-style argument shows you can fit such directions for some constant . At , that’s near-orthogonal directions — vastly more than the dimensionality. Anthropic’s superposition work (Elhage et al., 2022) demonstrated this empirically: a small autoencoder trained on sparse features learns to pack 5-10× more features than dimensions, with the exact ratio set by feature sparsity. The geometry doesn’t fight you when features are rare; it cooperates.

Where orthogonality concentration shows up

Untrained / randomly-initialized embeddings — cosines cluster near 0
Cross-lingual zero-shot — different language tokens occupy near-orthogonal subspaces until aligned
Sparse autoencoder features — many more than dimension count, all near-orthogonal
Truncated SVD / PCA outputs — top-k components are near-orthogonal by construction
Random projections (Johnson-Lindenstrauss) — preserve distance because of orthogonality

Practical consequences

Don’t trust raw cosine on untrained vectors. PCA outputs, raw token-level embeddings, randomly-initialized model weights — all have cosine distributions concentrated around 0. The signal is in the deviation from that.
Don’t worry about dimensionality too much for storage. Truncation often loses much less accuracy than expected because the meaningful signal lives in a low-dimensional manifold; orthogonality concentration says the rest of the space is mostly noise anyway.
Embedding models work because they fight the geometry. A trained embedding’s “good cosine of 0.85” is not a coincidence — it’s the result of training pulling related items dramatically far from where random vectors would have been.

Go further

How does feature superposition exploit this?

Because there's room for vastly more near-orthogonal directions than strictly orthogonal ones, models can pack many more learned features into a fixed dim count than naive linear-algebra intuition suggests. That's how a 4096-dim model can encode tens of thousands of distinguishable concepts.

Embedding

What does this mean for embedding similarity scores?

A trained-embedding cosine of 0.85 is a strong signal — random pairs sit near 0, so anything well above the noise floor reflects real learned alignment. But that also means absolute thresholds don't transfer between models; calibration is per-model.

Why don't compression tricks (truncation, quantization) destroy this geometry?

The signal lives on a low-dimensional manifold inside the full vector — most of the high-D space is the noise floor. [JL bounds](/concepts/johnson-lindenstrauss/) tell us pairwise distances survive aggressive projection, so the learned-vs-random gap survives too.

Johnson-Lindenstrauss lemma Matryoshka representation learning Embedding quantization

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs