Also known as: high-dimensional behavior, concentration of measure
TL;DR
In high-dimensional spaces, distance and similarity behave counterintuitively — random points become nearly equidistant, volume concentrates near the surface of any region, and naive nearest-neighbor search loses much of its discriminative power.
The curse of dimensionality is the family of phenomena where intuition built in 2D or 3D stops applying once you’re operating in 1024, 2048, or 4096 dimensions — exactly where modern embedding models live.
The model has to fight the geometry. Trained embeddings exist precisely to drag semantically similar pairs out of the curse’s natural orthogonality.
What goes wrong
Several things, all at once:
High-dimensional pathologies
Distance concentration. In high-D, the distance from a query point to its nearest neighbor and to its farthest neighbor become nearly equal. The ratio approaches 1 as dimension grows. So “find the closest document” becomes a very weak signal — almost everything is roughly equidistant.
Volume concentrates near the surface. A high-D ball has almost all its mass in a thin shell near the boundary, not the center. This breaks geometric intuitions like “small perturbations are close to the original”.
Random vectors are nearly orthogonal. In d dimensions, two random unit vectors have a cosine similarity close to 0 with probability that approaches 1 as d grows. By d = 1024, this is overwhelming. See orthogonality concentration .
What it means for retrieval
Naive cosine similarities between random high-D vectors hover around 0. The meaningful similarity signal has to be learned into the geometry — contrastive training drags semantically similar pairs toward each other (cosine close to 1) against the curse’s natural pull toward orthogonality.
It’s also why first-pass retrieval at billion-document scale benefits from approximate methods (HNSW, IVF) rather than exact nearest-neighbor: in high-D, “approximate” loses surprisingly little because the exact answer was already a noisy signal.
Why we still get away with it
Two reasons cosine similarity remains useful in high-D despite all this:
Trained embeddings concentrate signal in a low-dimensional manifold. The full 2048-dim space is mostly noise; the meaningful structure lives on a much lower-dimensional manifold, which is why aggressive dimension truncation and quantization cost so little accuracy.
The Johnson-Lindenstrauss lemma bounds how badly random projection can distort distances. Even drastic dimension reduction preserves the relative ordering reasonably well.
So in practice you fight the curse with: trained embeddings (signal), reranking (precise re-scoring once the candidate set is small), and dimension reduction (cheap operating points without much loss).
A randomly initialized 1024-dim embedding does suffer from full distance concentration — every pair of documents lands at roughly the same cosine. After contrastive training on millions of (query, positive, negative) triples, the geometry is no longer random. Semantically related pairs have been actively dragged together in the optimization, against the gradient that would pull them apart toward orthogonality.
The result is a learned manifold inside the ambient high-D space. The full 1024-dim space is mostly noise; the meaningful structure lives on a much lower-dimensional submanifold — typically estimated at 50-200 effective dimensions for a strong text embedder. The curse still applies off the manifold, but on it, the signal-to-noise ratio is much higher.
This is also why aggressive dimension reduction (Matryoshka, PCA, learned low-rank projection) costs surprisingly little accuracy — you’re projecting onto a subspace that already captures most of the manifold structure. Truncating to 256-dim often loses 1-2% NDCG; truncating to 64-dim might lose 5-10%. The full ambient dimension is buying you capacity, not signal.
No. The curse acts on distance computation in a precomputed embedding space. A bi-encoder computes embeddings independently and then takes a cosine — it sits squarely in the high-D distance regime and pays full curse cost.
A cross-encoder doesn’t compute a distance at all. It feeds the concatenated (query, document) text through a transformer and reads a learned scalar score off a classification head. There’s no embedding step, no cosine, no high-D pairwise distance. Whatever effects high dimension has on the internal hidden states get absorbed into the model’s training; the output is task-shaped, not geometry-shaped.
This is why two-stage retrieval is so robust: the first stage’s coarse bi-encoder pass gets the noisy top-100 from a high-D space, then the cross-encoder reranker, immune to distance concentration, sorts them with full precision. Each stage plays to its strengths.
Go further
If high-D is mostly noise, why do embedding models use 2048+ dimensions at all?
Capacity. Even though most of the space is noise, the model needs enough room to encode many near-orthogonal feature directions (feature superposition). The high-dim space gets compressed at index time via [truncation](/concepts/mrl-matryoshka/) and [quantization](/concepts/embedding-quantization/) once the relevance signal is captured.
First-pass dense retrieval suffers most from the curse — distance concentration makes top-k a noisy signal. A [cross-encoder reranker](/concepts/cross-encoder/) doesn't compute pairwise distances at all; it scores each candidate jointly with the query, so it sidesteps the geometric collapse entirely.
What's the formal bound on how badly random projection distorts distances?
The Johnson-Lindenstrauss lemma: you can compress any N points to O(log N / ε²) dimensions while preserving pairwise distances within (1 ± ε). The original dimension drops out entirely — it's why aggressive dimension reduction works.