Model Quantization

Also known as: weight quantization, post-training quantization, PTQ, quantization-aware training, QAT, GPTQ, AWQ

TL;DR

Compressing the weights of a trained model from fp16 / bf16 down to int8, int4, fp8, or fp4 representations to fit larger models on smaller hardware and increase inference throughput.

Model quantization compresses the weights of a trained neural network — almost always a transformer, in 2026 — from their training-time representation (fp16, bf16) into a smaller representation (int8, int4, fp8, fp4) for inference. The point is not the disk size; it’s the HBM footprint and the bandwidth needed to feed those weights to the tensor cores. A 70B model in fp16 is 140 GB and won’t fit on a single H100; the same model in int4 is 35 GB and fits comfortably on one, with roughly 3-4x faster decode throughput because the kernel is moving a quarter of the bytes through HBM.

This is distinct from embedding quantization , which compresses retrieval vectors for ANN search. Different math, different goals, similar name.

The bit-width spectrum

Common quantization formats in 2026

fp16 / bf16. Training-default. 2 bytes per weight. The unquantized baseline.
fp8 (E4M3, E5M2) . Half-precision-of-half-precision floats. 1 byte per weight. Hopper and Blackwell have native fp8 tensor cores. ~1% quality hit on most evals; preferred when the hardware supports it.
int8. 8-bit integers with per-channel or per-token scales. 1 byte per weight. Nearly free in quality; the production sweet spot when memory pressure isn’t extreme.
int4 / nf4 . 4-bit integers (or NormalFloat4 — a non-uniform distribution-matched 4-bit format). 0.5 bytes per weight. 2-3% quality hit with a good PTQ method; the dominant format for memory-constrained serving.
int3, int2, ternary. Research territory. Quality cliff for most pretrained models; methods like AQLM and QuIP push the frontier.

The non-obvious fact is that the quality curve is not linear in bits. Each halving costs more than the last.

The PTQ → QAT spectrum

The methods for getting from fp16 to a smaller format span a wide range of cost and quality:

Naive PTQ. Round each weight to the nearest int8 (or whatever target) using a single global scale per tensor. Works for int8 on robust models. Falls apart at int4 or on models with outlier channels.

Calibration-based PTQ. Run a small calibration dataset (128-1024 examples) through the model and observe activation distributions. Use that to pick per-channel or per-group scales. This is where GPTQ, AWQ, and SmoothQuant live. The calibration data doesn’t have to be labeled; it just has to be representative of the inference distribution.

Quantization-aware training (QAT). Insert “fake-quantize” nodes into the training graph (round during forward, identity during backward via the straight-through estimator) and fine-tune for hundreds to thousands of steps. The model learns weights that are robust to the eventual quantization. Recovers most of the loss at aggressive bit widths but costs a real training run.

A trained transformer’s weight distributions are not Gaussian. A small fraction of channels — often less than 1% — have weights with magnitudes 10-100x larger than the typical channel. If you pick a single global scale to cover the maximum magnitude, the bulk of the weights are quantized to a tiny number of buckets near zero — you’ve spent 250 of the 256 int8 buckets on the outliers and 6 on everything else.

The fix is per-channel (sometimes per-group, e.g., groups of 64-128 weights) scaling: each channel gets its own scale, the outliers occupy their own scale, and the typical channels get the full bucket range to themselves. This is roughly the difference between “int4 destroys the model” and “int4 costs 2% on MMLU.”

GPTQ, AWQ, and SmoothQuant are all variations on “find the outlier channels via calibration, protect them with their own scaling, and quantize the rest aggressively.” The papers differ in how they identify and protect — GPTQ via second-order error minimization, AWQ via activation magnitudes, SmoothQuant via per-channel rescaling that pushes difficulty from activations into weights — but the core recognition is the same.

The GPTQ / AWQ / SmoothQuant family

These three are the dominant PTQ methods in 2026 production serving. They are not alternatives; they target different deployment configurations.

GPTQ . Layer-wise weight-only int4 quantization via Hessian-weighted least-squares. Each linear layer is quantized to minimize the reconstruction error of its output on the calibration set. The dominant choice for “pure weight quantization” — keep activations in fp16, quantize only the weights. Used by most of the int4-on-disk model checkpoints on Hugging Face.
AWQ (Activation-aware Weight Quantization). Identifies weight channels paired with high-magnitude activations as the “salient” channels and protects them with per-channel scaling. Slightly faster to compute than GPTQ and produces comparable quality; the kernel side (awq library) ships highly optimized int4 GEMMs.
SmoothQuant. Targets W8A8 (int8 weights and int8 activations). Migrates the quantization difficulty from activations to weights via a per-channel rescaling, then runs naive int8 PTQ on the easier distributions. Enables tensor-core int8 throughput end-to-end without QAT.

The choice depends on whether activations are also quantized (W8A8 vs W4A16), whether you can afford QAT, and what kernels are available on your target hardware.

KV-cache quantization: a separate axis

Weights aren’t the only thing in HBM. The KV cache at long context can exceed the model size — at 128K context with a 70B model, the KV cache is tens of gigabytes per sequence. Quantizing the KV cache (typically to fp8 or int4 with per-token scaling) is an independent decision from weight quantization, with its own quality curve.

Most production stacks now offer it as a serving flag (--kv-cache-dtype fp8 in vLLM). It composes cleanly with weight quantization — you can run a W4A16 model with an fp8 KV cache and get the bandwidth savings on both axes.

What you actually get

A reasonable production playbook in 2026:

W8A16 (int8 weights, fp16 activations) via PTQ. Try first. ~1% quality, 2x memory savings, modest throughput gain.
W4A16 via GPTQ or AWQ (or GGUF Q4_K_M for CPU / Apple-Silicon targets). When memory pressure is real. ~2-3% quality, 4x memory savings, large decode throughput gain in the memory-bound regime.
W8A8 via SmoothQuant. When tensor-core int8 throughput matters more than absolute quality.
fp8 end-to-end. When you’re on H100/H200/B200 and have native fp8 tensor cores. Best quality-per-bit for the hardware that supports it.
QAT. When PTQ degrades beyond tolerance. Reserve for the foundational models where you can amortize the training cost across many deployments.

Quantization is not a uniform compression — it’s a per-layer, per-channel, per-tensor decision. Production stacks that “just quantize the model” leave 1-2% of quality on the table that careful per-tensor sensitivity analysis would have recovered. Embedding tables, output projections, and the first/last transformer layers are usually more sensitive than the middle layers; mixed-precision quantization (int8 for the sensitive layers, int4 for the rest) recovers most of the missing quality at marginal memory cost.

Model quantization is the workhorse optimization that made 70B-class models routinely deployable on single-GPU servers and 400B-class models deployable on 8-GPU nodes. Combined with KV-cache compression and paged serving , it is the difference between “this model is impossible to serve economically” and “this model serves at 1000 tokens/sec/GPU on commodity hardware.”

Go further

What's the difference between PTQ and QAT?

Post-training quantization (PTQ) takes a finished fp16 model and converts the weights to a smaller representation in a single pass — possibly using calibration data to pick scaling factors but without ever updating weights. Cheap, fast, the default. Quantization-aware training (QAT) inserts fake-quantize nodes into the training graph so the model learns weights that are robust to the eventual quantization. Slower (you have to fine-tune for thousands of steps) but recovers most of the quality loss at aggressive bit widths. Production playbook: try PTQ first, fall back to QAT only when PTQ degrades the eval beyond your tolerance — usually below 4 bits.

Mixed-precision training

What do GPTQ, AWQ, and SmoothQuant actually do differently?

All three are PTQ variants that use a small calibration dataset (~128-1024 samples) to choose better per-channel scaling factors than naive min/max. GPTQ uses a layer-wise second-order method — minimizing the reconstruction error of each linear layer's output by solving a Hessian-weighted least-squares — and is the dominant choice for int4 weight-only quantization. AWQ (Activation-aware Weight Quantization) observes that a small fraction of weight channels matter much more than others and protects them with per-channel scaling. SmoothQuant migrates the quantization difficulty from activations to weights via a per-channel rescaling, enabling W8A8 (int8 weights and activations) without QAT. They're all attacking the same problem — outlier channels that wreck naive scaling — from different angles.

vLLM serving

Why is sub-4-bit so much harder than int4?

Quality degradation is non-linear in bits-per-weight. Int8 is essentially free on most models — sub-1% perplexity hit. Int4 with a good PTQ method (GPTQ, AWQ) costs 1-3% on standard benchmarks and is the production sweet spot for memory-bound serving in 2026. Drop to 3 bits and the loss curve bends sharply: outlier channels that were tolerable at 4 bits become catastrophic, and you typically need QAT or mixed-precision tricks (int4 most weights, int8 for sensitive layers) to recover. Sub-3-bit is active research — methods like QuIP, AQLM, and bitnet-style ternary representations — and quality at the frontier model scale is still being validated. The intuition: you're losing signal the way a low-bit JPEG loses detail; the first halving is invisible, the third is brutal.

KV cache Embedding quantization

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs