Product Quantization

Also known as: PQ, product quantizer, asymmetric distance computation, ADC

TL;DR

Product quantization (PQ) compresses a vector by splitting it into M sub-vectors and quantizing each independently against a small codebook learned via K-means.

Product quantization (PQ) is the compression scheme that makes billion-vector retrieval fit in tens of gigabytes of RAM. The idea is geometric: split a high-dimensional vector into M lower-dimensional sub-vectors, then quantize each sub-vector independently against a small codebook of 256 prototypes learned per sub-space. The original vector becomes M bytes. A 768-dim float32 vector — 3072 bytes uncompressed — collapses to 96 bytes at M=96, roughly 32x compression, with recall loss measured in single-digit percent.

PRODUCT QUANTIZATIONSplit, look up, ship one byte per chunk.1 VECTOR · 768 FLOATS · 3072 BM = 96sub 18 dimssub 28 dimssub 38 dimssub 48 dimssub 58 dimssub 68 dims256 protos256 protos256 protos256 protos256 protos256 protos1428 bits178 bits2038 bits888 bits568 bits2318 bitsCOMPRESSED · 96 CODES · 96 BSTORAGE PER VECTOR3072 B96 B32× SMALLERrecall@10 ≈ 0.94 · billion-vector indexes that fit in RAMONE 768-DIMENSIONAL FLOAT VECTOR

How the compression works

Pick M, the number of sub-spaces. Each sub-vector has dimension d/M (d=768, M=96 means 8 dims per sub-vector). For each sub-space, run K-means on the training set with K=256 to find 256 prototype sub-vectors — the codebook for that sub-space. Each codebook is small: 256 * (d/M) floats.

To quantize a vector: split it into M sub-vectors, look up the closest codebook entry in each sub-space, replace the sub-vector with the integer index of that prototype (0-255, fits in one byte). The compressed vector is M bytes. The original is reconstructible (lossily) by concatenating the M chosen prototype sub-vectors.

The alternative is vector quantization (VQ): one big codebook that maps each full vector to one of K prototypes. To match PQ’s compression at M=96, you’d need K = 256^96 prototypes — impossible. PQ exploits the fact that each sub-space can be quantized independently with its own 256-entry codebook, giving an effective vocabulary of 256^M for the joint space at a memory cost of only 256 * M codebook entries. The catch is that PQ assumes sub-spaces are roughly independent — when they aren’t, residual quantization (PQ on the error vector) or optimized PQ (rotating before splitting) recover quality.

The compression ratio is set by M and the codebook size (almost always 256 → 8 bits per sub-code). M is the lever for the compression/quality trade-off.

Why search is fast over compressed vectors

The clever piece is asymmetric distance computation (ADC). At query time:

  1. Take the float query, split into M sub-vectors.
  2. For each sub-space, precompute the distances from the query sub-vector to every one of the 256 codebook entries. This is one 256-entry lookup table per sub-space, computed once per query.
  3. To compute the (approximate) distance from the query to a compressed vector with codes (c_1, c_2, …, c_M): sum table_1[c_1] + table_2[c_2] + … + table_M[c_M]. Just M table lookups and M-1 additions.

A typical d=768 brute-force exact distance is 768 multiply-adds. The ADC version with M=96 is 96 table reads and 95 adds. With cache-friendly memory layout (SIMD over the codes), a modern CPU evaluates millions of PQ distances per second. At billion-vector scale paired with — only scan the top few cells — total query-time distance computations drop into the millions, giving sub-100ms latency.

The standard configuration

Typical PQ settings
  • M = d / 8 is a common default. 768-dim → M=96 (8 dims per sub-vector). Each sub-vector codebook has 256 entries with 8 dims each.
  • 8-bit codes per sub-vector is universal. 4-bit codes exist but are usually a quality loss not worth the extra 2x compression.
  • Codebook training on 100K to a few million sample vectors is enough; running K-means on the full corpus rarely improves quality.
  • Optimized PQ (OPQ) rotates the vector with a learned orthogonal matrix before splitting, improving recall by a few percent for sub-spaces that aren’t naturally aligned with axes.
  • Residual PQ (PQR) quantizes the residual after a coarse quantizer (the IVF cell centroid is the usual coarse step). Recovers some of the quality lost in raw PQ.

What PQ trades

The other unavoidable cost: PQ distances are approximate. Recall@10 vs exact distance is typically 0.92-0.97 on standard configurations — better than enough for when followed by a rescore stage with full-precision vectors on the top candidates. Without rescoring, PQ alone is acceptable for most ANN workloads but not state-of-the-art.

Where PQ shows up

PQ is the default compression in FAISS for any billion-scale index. IVF-PQ and HNSW-PQ are the two dominant configurations: IVF-PQ wins on memory; HNSW-PQ wins on recall/latency at the cost of more graph overhead. Both depend on PQ to make the vector storage tractable. Beyond FAISS, ScaNN uses an asymmetric variant (anisotropic vector quantization) that’s a close relative; Vespa, Milvus, and other engines all ship PQ-based options for memory-constrained deployments. It’s the compression backbone of the field.

Product quantization is what moves dense retrieval from “fits in memory at millions of vectors” to “fits in memory at billions” — without it, ANN at scale is a disk-bound problem.

Go further

How does PQ differ from int8 quantization?

Int8 quantizes each scalar dimension independently to one of 256 values — 4x compression vs float32. PQ quantizes sub-vectors jointly, so a single PQ code can encode much more structure than a per-dimension scalar. Compression is much higher (32-64x typical) and recall is better at high compression rates because PQ exploits coordinate correlations. Int8 is simpler and faster; PQ is denser and slower per distance.

What does 'asymmetric distance computation' mean?

At query time, the query stays uncompressed. For each query, you precompute a small lookup table: distances from the query's sub-vectors to each of the 256 codebook entries per sub-space. Computing the distance to any compressed vector becomes M table lookups and M-1 additions — extremely fast. Asymmetric because query is float, document is quantized.

What's the recall hit, really?

Depends on M (sub-vectors), bits per code (almost always 8), and the embedding's coordinate structure. For 768-dim embeddings with M=96 (8 dims per sub-vector, 8-bit codes), recall@10 vs exact typically lands at 0.92-0.97. Adding a rescore stage with full-precision vectors on the top-100 candidates closes most of the gap with negligible latency cost.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord