A 1984 result that says you can reduce a high-dimensional vector to a much lower dimension via random projection while approximately preserving pairwise distances. The mathematical reason aggressive dimension truncation works for embeddings.
The Johnson-Lindenstrauss (JL) lemma is the formal statement that random projections preserve distances. Specifically: any set of N points in any dimension can be embedded into O(log N / ε²) dimensions such that all pairwise distances are preserved within a factor of (1 ± ε).
The dependence on ε is sharp; the dependence on dimension of the original space doesn’t appear at all. That’s the surprising part: you can squash a million points from 10,000 dimensions down to about 700 dimensions and still recover all pairwise distances within ~10%. The original dimension is irrelevant to how aggressively you can compress.
The recipe
The proof is constructive:
Pick a random matrix R of shape (target_dim × original_dim) with entries drawn from a normal distribution N(0, 1 / target_dim).
For each original vector v, the projected vector is R @ v.
That’s it. No training, no learning. Just multiply by a random matrix and your distances mostly survive.
In practice the random Gaussian matrix can be replaced with sparser, faster alternatives (random ±1, sparse Bernoulli, Hadamard transforms) and the lemma still holds.
The proof uses concentration of measure: the squared length of a random projection of a fixed unit vector is distributed as a scaled chi-squared, and chi-squared concentrates exponentially fast in the target dimension regardless of how high-dimensional the input is. Combine that with a union bound over the C(N,2) pairs of points and you get the O(log N / ε²) target dimension. The original dimension only enters the cost of computing the projection, not the bound itself. This is the technical reason large embedding models can be aggressively compressed: their input dimensionality is high, but the number of distinct points (corpus size) is what governs how aggressively you can shrink.
The original dimension is irrelevant. Corpus size is what governs how much you can compress.
A 2048-dim embedding has way more capacity than needed for a million-document index. By JL, you could drop to ~700 dims and lose almost nothing.
The same logic applies to binary quantization. If you replace each dimension with its sign, distances become Hamming distances scaled by some constant — and provably preserved within bounded distortion.
This is also part of why the curse of dimensionality is less terrifying than it sounds. The curse says “high-D is weird”; JL says “high-D is also redundant — you can throw a lot away”.
What it doesn’t say
JL says distances are approximately preserved. It does not say:
Cosine similarities are preserved exactly (they are approximately preserved when vectors are unit-normalized, but the constants change).
A trained projection (like an MLP head) preserves distances. Trained projections deliberately change distances to bring related items closer and push unrelated items farther — that’s the point of training. JL is the worst-case-input baseline; trained projections do better.
You can use any dimension and be fine. The bound is . For a billion documents at ε = 0.1, you need around 2000 dimensions. Below the JL bound, distortion grows.
Go further
Does this mean MRL training is unnecessary?
Sort of — JL guarantees a worst-case bound for any random projection, so a fixed linear truncation already preserves most of the structure trained embeddings encode. MRL trains the model to be aware of the truncation, which costs accuracy at full dim; pure JL-style truncation preserves full-dim accuracy.
How does this relate to the curse of dimensionality?
They're two sides of the same coin. The curse says high-D is weird and signals get diluted; JL says high-D is also redundant and you can throw most dimensions away with bounded distortion. Both are true — and both are why production retrieval can compress aggressively.
Below the bound (~O(log N / ε²) dimensions for N points), distortion grows quickly. JL also gives no guarantees for the learned projections inside neural nets — those deliberately distort distances to bring related items closer. JL is the worst-case-input baseline; trained projections do better.