What does 'salient' actually mean for a weight channel?
A weight has high saliency if perturbing it produces large changes in the layer output. For a linear layer
Also known as: Activation-aware Weight Quantization, salient-weight protection
INT4 weight-only quantization that protects the salient weight channels — the ones multiplied by large activations — by absorbing a per-channel scale into the weights before rounding.
AWQ — Activation-aware Weight Quantization, Lin et al., 2023 — is the spiritual successor to GPTQ and the practical default for INT4 weight-only quantization in 2024–2026. The insight is one line: in a quantization-error sense, not all weights matter equally. A weight multiplied by a large activation contributes more to the layer output than a weight multiplied by a small activation, so quantization error on the high-activation weights dominates the layer-output error.
If you can identify those salient weight channels and protect them with a per-channel scale before rounding, you get most of GPTQ’s quality with none of its second-order math.
Take a linear layer
So if input dimension
Activation magnitudes across input channels (per-channel mean |x_j|):
|
| o <- "salient" channels: high mean |x_j|, dominate error
|
|
| o o o o o
|o o o o o o o o o
| o o o o o o o o
+---------------------------------------> channel j
<-- ~99% typical -->
GPTQ handles this by computing the full Hessian
AWQ’s central trick is an algebraic identity:
Here
Without AWQ scaling:
x --(quantize W)--> W_q * x
^
| large quant error on salient channels
| (their large weights round to coarse grid)
With AWQ scaling (s_j large for salient channel j):
x --(* s)--> s*x --(quantize W * 1/s)--> W'_q * (s*x) = W_q * x (smaller error)
^
| W' = W/s has smaller weights on salient channels
| -> they round more accurately at INT4
|
+ s*x is computed at FP16, no precision lost there
In practice
AWQ doesn’t try to derive
For each linear layer:
1. Run ~16-128 calibration samples, capture activation X.
2. Compute per-channel mean magnitude: m_j = mean_k |X[k, j]|
3. For alpha in [0.0, 0.05, 0.10, ..., 1.0]:
s = m^alpha (per-channel scaling vector)
W' = W * diag(1 / s) (apply inverse scale to weights)
W'_q = quantize_INT4_per_channel(W')
err(alpha) = || X * W - (X * diag(s)) * W'_q ||_F^2
4. Pick alpha* that minimizes err. Store s* and the quantized W'_q.
5. At runtime: dequantize W'_q to FP16, fuse the diag(s) with the input
(or with the previous layer's output projection for free).
That’s the entire algorithm. No Hessian inversion, no Cholesky factorization, no per-column ordering decisions. The grid search costs O(20) forward passes per layer, which is dominated by the matmul cost — calibration runs in seconds rather than minutes.
You could in principle solve for
You could differentiate through a soft / straight-through estimator, but the loss surface is benign enough that a coarse grid search gives a within-1% answer with much less code. Lin et al. observe that the optimal
The deeper reason this works: the salient-channel structure is sharply bimodal (a small set of high-magnitude channels and a big body of typical channels), so any
The objective function is the layer-output reconstruction error on the calibration set, same as GPTQ. The difference is what’s optimized: GPTQ optimizes the quantized weights
| Method | Compensation | Calibration set | Time per layer | INT4 quality (ΔPPL on LLaMA) |
|---|---|---|---|---|
| Naive PTQ | none | none | seconds | +5 to +20 (often unusable) |
| GPTQ | Hessian-based per-column update | ~128 samples | 1–10 minutes | +0.3 to +1.0 |
| AWQ | Pre-quantization per-channel scale | 16–128 samples | seconds | +0.2 to +0.7 |
Both methods address the same fundamental issue — outlier channels that wreck naive scaling — but from different angles. GPTQ minimizes residual error after a fixed quantization grid via a closed-form weight update. AWQ rotates the problem upstream by changing the scale before quantization happens, so the grid becomes an easier target.
The practical consequences:
awq_marlin and TensorRT-LLM AWQ kernels are state-of-the-art for INT4 weight-only INT4 GEMM on Hopper.AWQ is GPTQ’s spiritual successor — better accuracy, simpler algorithm, no Hessian inversions. The 2024 default for INT4 weight-only quantization, and the practical baseline you should beat with NVFP4 / MXFP4 if you’re claiming a new quantization format.
awq_marlin-style kernels.--quantization awq loads AWQ checkpoints with the same throughput characteristics as GPTQ. Same INT4 weight-only kernels, slightly different on-disk layout.llm-awq (the reference repo) — Python tooling for converting FP16 → AWQ on a calibration set. ~10 minutes to quantize a 7B model, ~1 hour for 70B.The decision tree:
AWQ won the post-GPTQ era of INT4 weight-only quantization and now sits as the production default. The 2026 question isn’t “GPTQ or AWQ?” — it’s “AWQ or NVFP4?”, and the answer depends on your hardware.
A weight has high saliency if perturbing it produces large changes in the layer output. For a linear layer
GPTQ's Hessian-weighted compensation update is mathematically optimal for the layer-output reconstruction objective at fixed quantization grid — but the dominant source of layer-output error is outlier channels, and you can address those without solving the full quadratic-optimization problem. AWQ rescales the salient channels before quantization so that they round more accurately, then quantizes everything with naive per-channel rounding. The remaining error from the non-salient channels is small enough that no compensation update is needed. Empirically AWQ matches or slightly beats GPTQ on most LLaMA-family models with 5–10x faster calibration.
Yes, in principle. The activation-aware scaling trick is orthogonal to the choice of quantization grid — you can absorb a per-channel scale into the weights and then quantize them with INT4, NF4, MXFP4, or NVFP4. In practice the production tools mostly do AWQ+INT4 because that's what the published kernels target, but research implementations (llm-awq, quantmlsys) have shown AWQ scaling improves NF4 quality and is being used as a building block inside Blackwell-targeted FP4 quantizers.