Singular Value Decomposition (SVD)

Also known as: singular value decomposition, thin SVD, truncated SVD, compact SVD, Eckart-Young

TL;DR

— every real matrix decomposes into rotation, axis-aligned stretch, and rotation. The single most-used matrix factorization in ML: powers PCA, LoRA, low-rank attention, embedding quantization, SSA, and the spectral analysis of any linear map.

Every real matrix admits a factorization

where and are orthogonal and is diagonal with non-negative entries . The are the singular values, the columns of the left singular vectors, and the columns of the right singular vectors. Existence is unconditional — symmetry, full rank, and squareness are not required — making SVD the universal factorization for linear maps between Euclidean spaces.

Geometric reading

A linear map sends the unit sphere of its domain to a hyperellipsoid in its codomain. SVD names the three operations that compose the map: rotates the domain to align input axes with the principal axes of the hyperellipsoid; stretches each axis by (zeroing out directions outside the rank); rotates the result into its final orientation. Every linear map is, up to choice of basis, a diagonal stretch.

SVD is the unique decomposition of any linear map into rotation, axis-aligned stretch, and rotation. Singular values are the stretch factors; singular vectors are the axes.

Relation to eigendecomposition

The two Gram matrices satisfy

so right singular vectors are of , left singular vectors are eigenvectors of , and singular values are the non-negative square roots of the shared eigenvalues. SVD therefore generalizes eigendecomposition to non-square and non-symmetric matrices and drops the requirement that eigenvalues be real or that a full eigenbasis exist.

Truncated SVD and Eckart-Young

Keeping only the top- singular triples gives the rank- approximation

Eckart-Young: is the closest rank- matrix to in both the Frobenius and spectral norms. No other rank- matrix matches more closely under either norm. This is the extremal property that justifies every low-rank method in ML — the singular spectrum names the cheapest way to throw away information.

Where SVD anchors in ML

Most matrix-spectrum machinery in modern ML is SVD with a wrapper around it.

SVD throughout the stack
  • . PCA is the SVD of the centered data matrix — right singular vectors are principal components, are explained variances.
  • . A low-rank adapter writes with rank . Eckart-Young justifies the parameterization: the best rank- approximation of any weight update is its truncated SVD.
  • . Training an embedding so any prefix of its dimensions stays useful is a learned analogue of SVD truncation — the prefix behaves like the top singular directions.
  • . Rotating into the singular-vector basis before quantizing concentrates variance into a few coordinates and lowers error per bit.
  • . SSA decomposes a time series by SVD of its ; trend, oscillation, and noise are read off the singular spectrum.
  • Spectral norm and condition number. and — optimization stability quantities are direct SVD readouts.
  • . Low-rank approximations of attention and MLP weights expose “circuits” — concentrated directions in singular-vector space tied to interpretable features.

Computing the SVD

Dense SVD uses Golub-Reinsch: bidiagonalize via Householder reflections, then iterate implicit-shifted QR until off-diagonals vanish, at cost. For top- only, randomized SVD (Halko-Martinsson-Tropp, 2011) sketches against a Gaussian and SVD-s the small sketch in with provably-tight accuracy on decaying spectra. Subspace iteration (the power method generalized) is cheapest for the leading singular pair and powers most spectral-norm estimators. Production routines (numpy.linalg.svd, TruncatedSVD) avoid forming — squaring the condition number destroys precision.

Among factorizations with diagonal, only SVD has rank- truncation minimize in every unitarily invariant norm at once. The geometric counterpart is the polar decomposition : diagonalizing the symmetric positive semi-definite stretch and absorbing into recovers . SVD diagonalizes the stretch part of a linear map after rotation has been factored out.

Go further

What does the Eckart-Young theorem say, and why does it matter for ML?

Truncating the SVD to the top- singular values gives the best rank- approximation under both the Frobenius and spectral norms — no other rank- matrix can match the original more closely. This extremal property is the theoretical guarantee underneath LoRA, MRL prefix truncation, and SSA's low-rank reconstruction. When a method works by keeping a few directions and discarding the rest, the reason it works is Eckart-Young.

What's the difference between full, thin, and truncated SVD?

Full SVD keeps all singular values and the full , . Thin (or compact) SVD drops trailing zeros and the corresponding columns of — the natural form for any rectangular . Truncated SVD keeps only the top- singular triples and is the only form that matters in practice for ML: every low-rank application uses it.

How does SVD relate to eigendecomposition?

SVD generalizes eigendecomposition to arbitrary rectangular and non-symmetric matrices. The singular values of are the (non-negative) square roots of the eigenvalues of (or equivalently ), and the right singular vectors are the eigenvectors of . For a symmetric positive semi-definite , the SVD and eigendecomposition coincide up to sign.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord