Why does GPTQ work so well on transformer linear layers specifically?
The reconstruction loss for a linear layer is
Also known as: GPT-Q, Optimal Brain Quantization for transformers, OBQ-derived quantization
Layer-by-layer 4-bit weight quantization that minimizes layer-output reconstruction error using a Hessian computed from a small calibration set.
GPTQ — Frantar et al., 2022 — is a layer-by-layer post-training quantization algorithm that pushed transformers to usable INT4 weight-only quantization for the first time. The key idea is borrowed from a 1990s pruning literature: when you change one weight, update the remaining weights to compensate rather than treating each weight independently. GPTQ formalizes that as a closed-form Hessian-weighted correction and makes it tractable for transformer-sized layers via Cholesky factorization and column blocking.
For roughly two years (mid-2023 through 2024) GPTQ was the dominant INT4 format on Hugging Face — millions of checkpoints — and the inference kernels in AutoGPTQ, ExLlama(V2), and vLLM were the first to make INT4 weight-only serving practically faster than FP16.
The reconstruction loss for a single linear layer is the sum-of-squares error between the FP16 layer’s output and the quantized layer’s output, taken over a calibration set
Naive PTQ rounds each weight independently to the nearest INT4 level. GPTQ instead picks the quantization order and the compensation update to the still-unquantized columns so that the layer-output error stays small across the calibration set.
For each layer l, with weight matrix W (d_in x d_out) and calibration X (K x d_in):
1. Forward K calibration inputs through the model up to layer l, capture X.
2. Compute Hessian:
H = 2 * X^T X / K + lambda * I (lambda = small ridge)
H_inv = Cholesky-based inverse, processed column-block by column-block
3. Walk columns of W in some order (often left-to-right within blocks of 128):
for each column j:
a. Quantize: q_j = round_to_INT4_grid( w_j ) (per-column scale)
b. Compute the per-element error: eps_j = w_j - q_j
c. Update remaining columns to compensate:
W[:, k>j] -= eps_j * H_inv[j, k>j] / H_inv[j, j]
d. Mark column j frozen.
4. Output quantized W_q. Activations remain FP16.
Key implementation details:
q_j = round_to_INT4_grid(w_j) uses.When you pin column
Take the unconstrained reconstruction loss
The Optimal Brain Surgeon (OBS) trick from LeCun, Hassibi & Stork (1990s pruning literature): if you constrain a single weight
That is, the perturbation to every weight is proportional to the
GPTQ uses this update with
The contribution of the GPTQ paper is not the OBS update — that’s three decades old — but the engineering: arbitrary (typically left-to-right) column ordering instead of saliency-greedy ordering, Cholesky decomposition of
On LLaMA-class models, GPTQ-INT4 typically lands within 0.5–1 perplexity of the FP16 baseline, with the larger gap on the smaller models (7B feels INT4 quantization more than 70B does). Concretely:
| Model | FP16 PPL | GPTQ-INT4 PPL | Δ |
|---|---|---|---|
| LLaMA-7B | 5.68 | 6.13 | +0.45 |
| LLaMA-13B | 5.09 | 5.40 | +0.31 |
| LLaMA-30B | 4.10 | 4.45 | +0.35 |
| LLaMA-65B | 3.53 | 3.84 | +0.31 |
(WikiText-2 perplexity, group size 128 — representative of the published results, exact numbers vary by reproduction.)
For most production tasks the perplexity gap translates to a 1–2% drop on downstream evals — usable, and often invisible inside generation noise. Smaller groups (32 instead of 128) cut the gap further at the cost of slightly more scale-overhead bits.
GPTQ takes a layer-output objective and a 1990s pruning theorem and turns them into a transformer-scale INT4 quantizer. The math is OBS; the engineering is what matters — Cholesky stability, block-of-128 column updates, per-channel grids. Without those pieces the algorithm is unworkable at transformer scale.
The GPTQ format is more than a quantization algorithm — it’s a file layout with matching INT4 GEMM kernels. The split:
AutoGPTQ, gptqmodel, Optimum-GPTQ. Run once on the FP16 model + calibration set, produce a packed INT4 checkpoint. Takes minutes to hours depending on model size.ExLlamaV2 (the canonical fast kernels), AutoGPTQ kernels, vLLM’s gptq_marlin kernel (fastest in 2025 — fuses dequantize-into-FP16-matmul on Hopper/Ampere). Handle the packed INT4 layout, per-channel scales, optional act-order permutation.TheBloke/-style) shipped as GPTQ. Still the most common format on disk.gptq_marlin. Sub-tile-size INT4 GEMM, ~2x decode throughput vs FP16 in memory-bound regime.The rough hierarchy in 2026:
You’re running on Hopper or Ampere, you want INT4 weights with FP16 activations, you have a calibration set and don’t mind ten minutes of quantization time, and you want kernels that have had three years of optimization tuning. Or you’re loading an existing GPTQ checkpoint and don’t want to re-quantize.
For everything else — new quantizations on Blackwell, training, end-to-end FP4 inference — GPTQ has been superseded. It earned its place as the format that proved INT4 weight-only is a viable serving target, then handed the baton.
The reconstruction loss for a linear layer is
Surprisingly small. The original paper uses 128 samples of 2048 tokens each — about 256k tokens total — and quality plateaus quickly past that. The Hessian
AWQ for new deployments. AWQ achieves comparable or slightly better INT4 accuracy with a simpler algorithm (no Hessian inversion, no second-order math), faster calibration, and kernels that are at least as well-optimized. GPTQ remains the format of millions of pre-quantized checkpoints on Hugging Face from the 2023–2024 era — if you're loading an existing GPTQ checkpoint into vLLM or ExLlamaV2, you get the same quality and the kernels are mature. For new quantizations from FP16, AWQ wins. For shipping NVFP4/MXFP4 on Blackwell, both GPTQ and AWQ are now legacy.