GPTQ — Hessian-Based Post-Training Quantization

Also known as: GPT-Q, Optimal Brain Quantization for transformers, OBQ-derived quantization

TL;DR

Layer-by-layer 4-bit weight quantization that minimizes layer-output reconstruction error using a Hessian computed from a small calibration set.

GPTQ — Frantar et al., 2022 — is a layer-by-layer post-training quantization algorithm that pushed transformers to usable INT4 weight-only quantization for the first time. The key idea is borrowed from a 1990s pruning literature: when you change one weight, update the remaining weights to compensate rather than treating each weight independently. GPTQ formalizes that as a closed-form Hessian-weighted correction and makes it tractable for transformer-sized layers via Cholesky factorization and column blocking.

For roughly two years (mid-2023 through 2024) GPTQ was the dominant INT4 format on Hugging Face — millions of checkpoints — and the inference kernels in AutoGPTQ, ExLlama(V2), and vLLM were the first to make INT4 weight-only serving practically faster than FP16.

The setup

The reconstruction loss for a single linear layer is the sum-of-squares error between the FP16 layer’s output and the quantized layer’s output, taken over a calibration set of activation rows:

Naive PTQ rounds each weight independently to the nearest INT4 level. GPTQ instead picks the quantization order and the compensation update to the still-unquantized columns so that the layer-output error stays small across the calibration set.

Algorithm flow

For each layer l, with weight matrix W (d_in x d_out) and calibration X (K x d_in):

  1. Forward K calibration inputs through the model up to layer l, capture X.

  2. Compute Hessian:
        H = 2 * X^T X / K  +  lambda * I       (lambda = small ridge)
        H_inv = Cholesky-based inverse, processed column-block by column-block

  3. Walk columns of W in some order (often left-to-right within blocks of 128):
       for each column j:
         a. Quantize: q_j = round_to_INT4_grid( w_j )           (per-column scale)
         b. Compute the per-element error:  eps_j = w_j - q_j
         c. Update remaining columns to compensate:
                W[:, k>j]  -=  eps_j  *  H_inv[j, k>j] / H_inv[j, j]
         d. Mark column j frozen.

  4. Output quantized W_q.  Activations remain FP16.

Key implementation details:

  • Block of 128. Columns are processed in blocks of 128. Inside a block, the Hessian inverse is computed lazily; across blocks, only the relevant submatrices are touched.
  • Cholesky for stability. Direct would be unstable on near-singular Grams. GPTQ Cholesky-factorizes once and re-uses the factor.
  • Per-channel grids. Each output column gets its own scale (and optional zero-point) — that scale is what q_j = round_to_INT4_grid(w_j) uses.

Why the compensation update works

When you pin column of to a discrete value , every output dimension that depends on input now has an error contribution . The other columns of — whose weights are still continuous — can be perturbed slightly to cancel most of that contribution in the output projection. GPTQ writes down exactly the perturbation that minimizes layer-output MSE under the constraint that is fixed at , and does it for free with one row-vector matrix-multiply per column.

Take the unconstrained reconstruction loss as a function of the perturbation . The gradient w.r.t. is and the Hessian is — independent of because the loss is quadratic.

The Optimal Brain Surgeon (OBS) trick from LeCun, Hassibi & Stork (1990s pruning literature): if you constrain a single weight and minimize the loss over the rest, the closed-form update is

That is, the perturbation to every weight is proportional to the -th column of , scaled so that . The induced loss increase is — the saliency of that weight.

GPTQ uses this update with (the quantization error) to push the error from column into the still-unquantized columns. Iterating column by column and refreshing over the unfrozen submatrix is mathematically equivalent to OBS-quantizing one weight at a time, but vectorized to whole columns and made tractable by Cholesky factorization.

The contribution of the GPTQ paper is not the OBS update — that’s three decades old — but the engineering: arbitrary (typically left-to-right) column ordering instead of saliency-greedy ordering, Cholesky decomposition of for stability, lazy block-of-128 updates so the active submatrix fits in cache, and an empirical demonstration that this all works at transformer scale at INT4.

What you actually get

On LLaMA-class models, GPTQ-INT4 typically lands within 0.5–1 perplexity of the FP16 baseline, with the larger gap on the smaller models (7B feels INT4 quantization more than 70B does). Concretely:

ModelFP16 PPLGPTQ-INT4 PPLΔ
LLaMA-7B5.686.13+0.45
LLaMA-13B5.095.40+0.31
LLaMA-30B4.104.45+0.35
LLaMA-65B3.533.84+0.31

(WikiText-2 perplexity, group size 128 — representative of the published results, exact numbers vary by reproduction.)

For most production tasks the perplexity gap translates to a 1–2% drop on downstream evals — usable, and often invisible inside generation noise. Smaller groups (32 instead of 128) cut the gap further at the cost of slightly more scale-overhead bits.

GPTQ takes a layer-output objective and a 1990s pruning theorem and turns them into a transformer-scale INT4 quantizer. The math is OBS; the engineering is what matters — Cholesky stability, block-of-128 column updates, per-channel grids. Without those pieces the algorithm is unworkable at transformer scale.

Software & kernels

The GPTQ format is more than a quantization algorithm — it’s a file layout with matching INT4 GEMM kernels. The split:

  • Quantization side. AutoGPTQ, gptqmodel, Optimum-GPTQ. Run once on the FP16 model + calibration set, produce a packed INT4 checkpoint. Takes minutes to hours depending on model size.
  • Inference side. ExLlamaV2 (the canonical fast kernels), AutoGPTQ kernels, vLLM’s gptq_marlin kernel (fastest in 2025 — fuses dequantize-into-FP16-matmul on Hopper/Ampere). Handle the packed INT4 layout, per-channel scales, optional act-order permutation.
  • Group size 128 by default. Smaller groups = better quality, more scale overhead, slightly slower kernels. Group size 32 is the high-quality variant; group size –1 (per-tensor) is the legacy fast variant.
GPTQ in the production serving stack
  • Hugging Face checkpoint zoo, 2023–2024. Most open-weight INT4 checkpoints (TheBloke/-style) shipped as GPTQ. Still the most common format on disk.
  • vLLM with gptq_marlin. Sub-tile-size INT4 GEMM, ~2x decode throughput vs FP16 in memory-bound regime.
  • ExLlamaV2. The fastest GPTQ inference kernel for single-batch local generation. Powers many local LLM apps that don’t use llama.cpp / GGUF.
  • GPTQ + LoRA serving. Load GPTQ INT4 base, attach BF16 adapters at runtime — a common deployment pattern, well supported in vLLM.

GPTQ vs the rest

The rough hierarchy in 2026:

  • — the practical default for new INT4 weight-only quantizations. Simpler, faster, equal or better quality.
  • GPTQ — equally usable, dominant on disk, slightly more legacy. Still a fine choice when you already have GPTQ checkpoints or kernel-level optimizations you trust.
  • — for QLoRA-shaped fine-tuning workflows. Different objective (Gaussian-fitted levels, no calibration data) but the same INT4-weight-only deployment shape.
  • / — for Blackwell-class hardware. Compresses activations as well as weights, with native tensor-core support.
  • K-quants — for CPU / Apple-Silicon inference. Different ecosystem, parallel evolution.

When to still pick GPTQ

You’re running on Hopper or Ampere, you want INT4 weights with FP16 activations, you have a calibration set and don’t mind ten minutes of quantization time, and you want kernels that have had three years of optimization tuning. Or you’re loading an existing GPTQ checkpoint and don’t want to re-quantize.

For everything else — new quantizations on Blackwell, training, end-to-end FP4 inference — GPTQ has been superseded. It earned its place as the format that proved INT4 weight-only is a viable serving target, then handed the baton.

Go further

Why does GPTQ work so well on transformer linear layers specifically?

The reconstruction loss for a linear layer is , which has a Hessian that is positive semi-definite and structurally cooperative — it has a tight low-rank-plus-diagonal structure on transformer activations. That makes the Hessian inverse stable to compute via Cholesky and makes the OBQ closed-form update numerically well-behaved at 4 bits. On non-transformer architectures with worse-conditioned activation Grams (e.g., dense fully-connected nets without residual streams), the same algorithm degrades. GPTQ is implicitly tuned for the transformer Hessian shape.

How big should the calibration set be?

Surprisingly small. The original paper uses 128 samples of 2048 tokens each — about 256k tokens total — and quality plateaus quickly past that. The Hessian is averaged across the calibration set; once you have enough samples to estimate the dominant eigenvectors of that matrix accurately (rank a few hundred for a 4096-dim layer), more data adds nothing. The calibration data should be roughly distribution-matched to inference traffic — a slice of C4 or RedPajama for a general LM, a slice of code for a code model — but this is a soft constraint.

GPTQ vs AWQ — which should I pick today?

AWQ for new deployments. AWQ achieves comparable or slightly better INT4 accuracy with a simpler algorithm (no Hessian inversion, no second-order math), faster calibration, and kernels that are at least as well-optimized. GPTQ remains the format of millions of pre-quantized checkpoints on Hugging Face from the 2023–2024 era — if you're loading an existing GPTQ checkpoint into vLLM or ExLlamaV2, you get the same quality and the kernels are mature. For new quantizations from FP16, AWQ wins. For shipping NVFP4/MXFP4 on Blackwell, both GPTQ and AWQ are now legacy.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord