Matrix Factorization

Also known as: matrix decomposition, low-rank approximation, NMF, factor model

TL;DR

Writing a matrix as a product of smaller or more structured matrices. SVD, NMF, QR, LU, Cholesky, eigendecomposition — same general idea under different structural constraints. Underlies essentially every low-rank method in modern machine learning.

Matrix factorization writes a matrix — a 2-tensor whose rows and columns are vectors — as a product (or ) whose factors are smaller, sparser, or more structurally constrained than itself. Every factorization is a constrained optimization choice — orthogonality, triangularity, non-negativity, diagonality, low rank — and the constraint is what makes the factors useful. Without one, is a factorization and tells you nothing.

The standard family

SVD — with orthogonal and diagonal non-negative. Exists for every matrix; optimal rank- approximation under Frobenius and spectral norms (Eckart-Young).
Eigendecomposition — for square diagonalizable . Columns of are invariant directions; holds scaling factors. Fails on defective matrices.
QR — with orthogonal, upper-triangular. The workhorse for least-squares and the inner loop of iterative eigenvalue solvers.
LU / Cholesky — or . Triangular factors make solvable in by back-substitution.
NMF — with entrywise. Used wherever non-negative things combine additively is faithful: topic models, parts-of-images decomposition.

Why this matters in ML

Almost every low-rank technique in modern ML is a matrix factorization in disguise. Real-world matrices have rapidly decaying singular spectra, so a rank- approximation with captures most of the signal and discards the noise.

The factorization viewpoint unifies methods that look unrelated on the surface:

LoRA parameterizes a weight update as with , , — training two thin factors instead of the full matrix.
PCA is eigendecomposition of the covariance, equivalent to truncated SVD of the centered data matrix.
Word2vec is approximately the SVD of a shifted positive-PMI matrix (Levy & Goldberg, 2014); modern dual-encoder embeddings are continuous extensions of the same recipe.
SSA runs SVD on a Hankel trajectory matrix to separate trend, oscillation, and noise.
Matryoshka embeddings train representations whose length- prefix is itself a usable vector — a discrete version of truncating an SVD at varying .

Every embedding model can be viewed as a learned low-rank factorization of an implicit matrix. The classic result is Levy & Goldberg (2014): word2vec with negative sampling approximately factors — the shifted positive PMI matrix — into two factors whose columns are the word and context embeddings.

Modern dual-encoders generalize this: the training objective shapes an implicit similarity matrix (relevance for DPR, cross-modal alignment for CLIP), and the encoders learn factor matrices whose inner product approximates . The rank of the factorization is the embedding dimension, which is why dimensionality is a bottleneck rather than a free parameter — it’s the rank of the matrix you’re trying to reproduce.

Go further

When should I use SVD vs NMF vs eigendecomposition?

SVD is the default for any rectangular matrix where you want the best low-rank approximation under Frobenius or spectral norm — it always exists and is numerically well-conditioned. NMF is the right choice when non-negativity is itself meaningful: topic models over word counts, parts-based decomposition of images, dose-response data. Eigendecomposition is for square matrices where invariant directions carry semantics — Markov-chain stationary distributions, attention weight matrices, graph Laplacians.

SVD Eigenvalue

Why is low-rank approximation almost always the right inductive bias?

Real-world matrices — data matrices, learned weight matrices, transition matrices — have rapidly decaying singular spectra. The top- singular values capture most of the Frobenius energy; the long tail is noise or redundancy. LoRA exploits this for weight updates, MRL exploits it for embedding prefixes, SSA exploits it for time-series components. Any time you can pick and lose little, factoring is a win.

LoRA

What's the relationship to embeddings?

Every embedding model can be read as a learned low-rank factorization of an implicit co-occurrence or similarity matrix. Word2vec is approximately the SVD of a shifted positive-PMI matrix (Levy & Goldberg, 2014), and modern dual-encoders are continuous-data extensions of the same recipe — train two factor matrices so their inner product reproduces a target similarity. The factorization view is what unifies classical matrix completion and modern neural retrieval.

Embedding

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs