Also known as: Principal Component Analysis, principal components, PCA
TL;DR
PCA rotates a dataset to align with its directions of maximum variance, then projects onto the top components. Computed via SVD of the centered data matrix.
Principal Component Analysis takes a dataset of points in -dimensional space and rotates the coordinate system to align with the directions of maximum variance. Project onto the top rotated axes and you have a -dimensional representation that captures more variance than any other linear projection of the same rank. The recipe is one of the oldest and most reliable tools in applied statistics, and it remains the right first move for understanding the structure of any high-dimensional dataset.
The algorithm in three lines
Given a data matrix (rows are samples, columns are features):
Center. Subtract the column mean: .
Decompose. Compute the SVD: .
Project. The top- principal components are the first columns of . The reduced representation is .
That’s it. The columns of are eigenvectors of the covariance matrix ; the singular values in are the square roots of the eigenvalues, scaled. The fraction of total variance captured by the top components is the explained-variance ratio:
Plot versus — the “scree plot” — and the elbow tells you where adding more components stops paying for itself.
PCA is the optimal linear projection for variance preservation. Anything that beats it is exploiting nonlinear structure (autoencoders, Matryoshka heads, t-SNE) or hand-tuned objectives.
SVD vs. eigendecomposition: pick SVD
Textbooks introduce PCA as eigendecomposition of the covariance matrix . Don’t actually do that in code. Forming explicitly costs flops, materializes a matrix that may not fit in memory for , and squares the condition number — so floating-point precision suffers when singular values span many orders of magnitude.
SVD on directly avoids all three problems: you never form , the cost is , and the conditioning is preserved. Every production PCA implementation (numpy.linalg.svd, scikit-learn’s TruncatedSVD, torch.svd_lowrank) goes via SVD.
A linear autoencoder with bottleneck width , trained with mean-squared reconstruction loss, recovers exactly the top- principal components (up to rotation within the bottleneck). The encoder weight matrix converges to . So PCA is the closed-form solution to the simplest possible representation-learning objective. Modern nonlinear autoencoders and trained dimensionality reductions like MRL are what you get when you let the encoder become nonlinear and replace MSE with a task-specific loss — but PCA is always the right baseline to compare against, because if your fancy method doesn’t beat PCA, it’s not adding anything.
Where PCA shows up in modern ML
PCA’s role has shifted from “the dimensionality reduction” to “the diagnostic and the baseline”:
Live PCA use cases
Embedding visualization. Project the top-2 or top-3 PCs of an embedding corpus to spot cluster structure, drift, or duplicates. Faster than UMAP and gives you axes you can reason about.
Whitening. Divide each principal component by its singular value to produce a representation with identity covariance. Used as a preprocessing step before cosine similarity when you want raw geometry independent of feature scale.
Index compression. Reduce a corpus of 2048-dim embeddings to 256 dim via PCA before feeding into HNSW . Almost always loses to Matryoshka by a few percent NDCG, but requires zero training.
Drift detection. Track the top few PCs of an embedding stream over time. Sudden shifts in the principal directions or a change in the explained-variance ratio is an early warning that the upstream model or input distribution has moved.
Initialization for nonlinear methods. UMAP and t-SNE benefit from PCA preconditioning because it removes most of the noise variance before the expensive nonlinear step runs.
PCA versus learned truncation for retrieval
The clean comparison is between PCA-truncating a fixed embedding model and using a model that was trained with truncation in mind. Empirically, on standard retrieval benchmarks (BEIR, MTEB):
A 2560-dim embedding model PCA-reduced to 512 dim typically loses 2–5% on NDCG@10 versus the full-dim baseline.
The same model trained with Matryoshka objectives, evaluated at 512 dim, loses ~1%.
The gap exists because Matryoshka explicitly optimizes for the truncated representation during training; PCA finds the best linear projection post hoc, which is always at least as bad. But the gap is narrow enough that PCA is the right starting point: it’s free, it’s a one-line numpy call on the document corpus, and it beats random projection (the JL lemma baseline) by a meaningful margin.
What PCA isn’t
PCA is linear. It can’t unfold a Swiss roll, separate a moon from another moon, or undo the geometry of nonlinear feature interactions. When your data has a manifold structure (most learned embeddings do, locally), PCA captures the tangent to that manifold near the data mean — useful for variance preservation, useless if you wanted to recover the manifold itself. For that you reach for kernel PCA, Isomap, UMAP, or autoencoders.
PCA also assumes that high-variance directions are interesting directions. That’s true for natural data (signal usually dominates over noise variance) and false in adversarial settings — if there’s a single noisy feature with huge variance, PCA’s top component will be that noise, and you’ll throw away the signal. Standardize features before PCA when their scales differ, or use a more robust variant (PPCA, sparse PCA) when outliers contaminate the covariance estimate.
Go further
PCA or Matryoshka for shrinking embeddings?
Matryoshka, almost always — but the gap is smaller than people assume. PCA on a held-out corpus typically loses 2–5% NDCG@10 versus full-dim, while a Matryoshka-trained model at the same target dim loses ~1%. PCA's advantage: zero training cost, works on any model. Use PCA as the no-effort baseline before deciding if learned truncation is worth the spend.
How does PCA relate to the Johnson-Lindenstrauss lemma?
JL says a random projection preserves distances within with high probability. PCA picks the best projection for the data you have — it preserves variance optimally. PCA wins on actual distortion when the data has low-rank structure (as embeddings do); JL wins when you don't have time to compute the SVD or when the data is genuinely full-rank.
PCA preserves global variance, not local neighborhoods. Two clusters that are clearly separable in the original 2048-D space can collapse onto each other in the top-2 PCs if their separating direction is a low-variance one. For neighborhood-preserving 2-D plots use UMAP or t-SNE; reach for PCA when you want axes you can interpret as orthogonal directions of spread, not as a faithful 2-D map.