AWQ — Activation-Aware Weight Quantization

Q: What does 'salient' actually mean for a weight channel?

A weight has high saliency if perturbing it produces large changes in the layer output. For a linear layer FORMULA, the output sensitivity to weight FORMULA is proportional to FORMULA — the magnitude of the activation it's multiplied by. So channels paired with high-magnitude activations are salient. AWQ identifies those channels via activation-magnitude statistics on a small calibration set (the per-channel mean of FORMULA across examples) and absorbs a scaling factor that effectively gives them more bits of precision before rounding to INT4. Roughly 1% of channels carry most of the saliency in a typical transformer linear layer.

Q: Can AWQ scaling be combined with NF4 or MX-formats?

Yes, in principle. The activation-aware scaling trick is orthogonal to the choice of quantization grid — you can absorb a per-channel scale into the weights and then quantize them with INT4, NF4, MXFP4, or NVFP4. In practice the production tools mostly do AWQ+INT4 because that's what the published kernels target, but research implementations (llm-awq, quantmlsys) have shown AWQ scaling improves NF4 quality and is being used as a building block inside Blackwell-targeted FP4 quantizers.

Also known as: Activation-aware Weight Quantization, salient-weight protection

TL;DR

INT4 weight-only quantization that protects the salient weight channels — the ones multiplied by large activations — by absorbing a per-channel scale into the weights before rounding.

AWQ — Activation-aware Weight Quantization, Lin et al., 2023 — is the spiritual successor to GPTQ and the practical default for INT4 weight-only quantization in 2024–2026. The insight is one line: in a quantization-error sense, not all weights matter equally. A weight multiplied by a large activation contributes more to the layer output than a weight multiplied by a small activation, so quantization error on the high-activation weights dominates the layer-output error.

If you can identify those salient weight channels and protect them with a per-channel scale before rounding, you get most of GPTQ’s quality with none of its second-order math.

The setup

Take a linear layer , where and . The output’s sensitivity to a perturbation at column is

So if input dimension has a large typical magnitude , then quantization error on the entire -th column of shows up amplified at the output. Empirically, in transformer linear layers, this dominates: a small fraction of input channels — typically less than 1% — carry magnitudes 10–100× larger than the typical channel, and those channels concentrate the error.

Activation magnitudes across input channels (per-channel mean |x_j|):

  |
  |          o    <- "salient" channels: high mean |x_j|, dominate error
  |
  |
  |  o   o  o    o    o
  |o o o    o   o  o      o   o    o
  | o    o    o     o o  o   o   o
  +---------------------------------------> channel j
              <-- ~99% typical -->

GPTQ handles this by computing the full Hessian and using a closed-form update to compensate downstream columns when each column is rounded. AWQ takes a more direct approach: scale the salient channels up before quantization (so they round to a finer grid in absolute terms), then absorb the inverse scale into the activation pathway.

The math

AWQ’s central trick is an algebraic identity:

Here is a vector of per-input-channel scaling factors. Multiplying by shrinks the weights along the salient channels (so they round to INT4 with smaller absolute error), and multiplying by expands those activations by the same factor. The product is unchanged in floating point — it’s a mathematical identity.

Without AWQ scaling:

   x  --(quantize W)-->  W_q * x
                          ^
                          |  large quant error on salient channels
                          |  (their large weights round to coarse grid)


With AWQ scaling (s_j large for salient channel j):

   x  --(* s)-->  s*x  --(quantize W * 1/s)-->  W'_q * (s*x)  =  W_q * x  (smaller error)
                                ^
                                |  W' = W/s has smaller weights on salient channels
                                |  -> they round more accurately at INT4
                                |
                                + s*x is computed at FP16, no precision lost there

In practice is large for high-activation channels and ≈1 for the rest. The salient channels’ weights become smaller before quantization (better rounding); their activations become larger at runtime (computed in FP16, no precision lost). The non-salient channels are essentially untouched.

Choosing the scales

AWQ doesn’t try to derive analytically. Instead it does a grid search: parameterize where is the per-channel mean activation magnitude on a calibration set, and search on a small grid (typically 20 values) to minimize the layer-output reconstruction error.

For each linear layer:
  1. Run ~16-128 calibration samples, capture activation X.
  2. Compute per-channel mean magnitude: m_j = mean_k |X[k, j]|
  3. For alpha in [0.0, 0.05, 0.10, ..., 1.0]:
       s = m^alpha                         (per-channel scaling vector)
       W' = W * diag(1 / s)                (apply inverse scale to weights)
       W'_q = quantize_INT4_per_channel(W')
       err(alpha) = || X * W  -  (X * diag(s)) * W'_q ||_F^2
  4. Pick alpha* that minimizes err. Store s* and the quantized W'_q.
  5. At runtime: dequantize W'_q to FP16, fuse the diag(s) with the input
                 (or with the previous layer's output projection for free).

That’s the entire algorithm. No Hessian inversion, no Cholesky factorization, no per-column ordering decisions. The grid search costs O(20) forward passes per layer, which is dominated by the matmul cost — calibration runs in seconds rather than minutes.

You could in principle solve for analytically by differentiating the reconstruction error w.r.t. and setting it to zero. The problem: the dependence of on is not differentiable — it goes through a hard rounding step. The output error is a piecewise-quadratic function of with discrete jumps at the rounding boundaries.

You could differentiate through a soft / straight-through estimator, but the loss surface is benign enough that a coarse grid search gives a within-1% answer with much less code. Lin et al. observe that the optimal is consistently in across LLaMA-family layers — different layers don’t differ much — so even a 5-point grid suffices in practice.

The deeper reason this works: the salient-channel structure is sharply bimodal (a small set of high-magnitude channels and a big body of typical channels), so any that distinguishes the two groups gets most of the benefit. Fine-tuning the exact buys diminishing returns.

The objective function is the layer-output reconstruction error on the calibration set, same as GPTQ. The difference is what’s optimized: GPTQ optimizes the quantized weights (over a discrete INT4 lattice, with a Hessian-weighted update); AWQ optimizes the scaling factor (over a continuous 1D grid) and lets naive INT4 rounding handle the rest. Roughly: GPTQ is “try harder per weight”; AWQ is “rotate the problem so per-weight rounding stops being hard.”

How it compares to GPTQ

Method	Compensation	Calibration set	Time per layer	INT4 quality (ΔPPL on LLaMA)
Naive PTQ	none	none	seconds	+5 to +20 (often unusable)
GPTQ	Hessian-based per-column update	~128 samples	1–10 minutes	+0.3 to +1.0
AWQ	Pre-quantization per-channel scale	16–128 samples	seconds	+0.2 to +0.7

Both methods address the same fundamental issue — outlier channels that wreck naive scaling — but from different angles. GPTQ minimizes residual error after a fixed quantization grid via a closed-form weight update. AWQ rotates the problem upstream by changing the scale before quantization happens, so the grid becomes an easier target.

The practical consequences:

AWQ calibrates 5–10x faster than GPTQ. No Hessian computation, no Cholesky, no per-column iteration.
AWQ matches or beats GPTQ in published benchmarks on LLaMA-2 and LLaMA-3 family models, by 0.1–0.3 perplexity at INT4.
AWQ kernels are at least as fast. The awq_marlin and TensorRT-LLM AWQ kernels are state-of-the-art for INT4 weight-only INT4 GEMM on Hopper.
AWQ is per-channel by construction. GPTQ uses per-channel grids too, but AWQ’s scaling step is the per-channel knob — it makes per-channel a first-class concept rather than an implementation detail.

AWQ is GPTQ’s spiritual successor — better accuracy, simpler algorithm, no Hessian inversions. The 2024 default for INT4 weight-only quantization, and the practical baseline you should beat with NVFP4 / MXFP4 if you’re claiming a new quantization format.

Where AWQ ships

AWQ in production

TensorRT-LLM — the canonical NVIDIA inference stack supports AWQ INT4 weights end-to-end with awq_marlin-style kernels.
vLLM — --quantization awq loads AWQ checkpoints with the same throughput characteristics as GPTQ. Same INT4 weight-only kernels, slightly different on-disk layout.
llm-awq (the reference repo) — Python tooling for converting FP16 → AWQ on a calibration set. ~10 minutes to quantize a 7B model, ~1 hour for 70B.
LoRA + AWQ — the AWQ INT4 base composes cleanly with BF16 LoRA adapters at serving time, same pattern as GPTQ + LoRA.
Hugging Face — most new INT4 checkpoints from 2024+ ship as AWQ rather than GPTQ.

When AWQ is not enough

The decision tree:

INT4 weight-only on Hopper / Ampere? AWQ. (GPTQ is fine if you already have checkpoints.)
INT4 weights and INT8 activations on Hopper? AWQ + SmoothQuant, in that order.
FP4 weights + FP4 activations on Blackwell? NVFP4 . AWQ’s scaling trick is being absorbed into the NVFP4 quantization tooling.
QLoRA-shaped 4-bit fine-tuning? NF4 . AWQ targets weight-only inference, not fine-tuning workflows.
CPU / Apple Silicon inference? GGUF K-quants. AWQ has GPU-targeted kernels; GGUF has CPU-targeted kernels.

AWQ won the post-GPTQ era of INT4 weight-only quantization and now sits as the production default. The 2026 question isn’t “GPTQ or AWQ?” — it’s “AWQ or NVFP4?”, and the answer depends on your hardware.

Go further

What does 'salient' actually mean for a weight channel?

A weight has high saliency if perturbing it produces large changes in the layer output. For a linear layer , the output sensitivity to weight is proportional to — the magnitude of the activation it's multiplied by. So channels paired with high-magnitude activations are salient. AWQ identifies those channels via activation-magnitude statistics on a small calibration set (the per-channel mean of across examples) and absorbs a scaling factor that effectively gives them more bits of precision before rounding to INT4. Roughly 1% of channels carry most of the saliency in a typical transformer linear layer.

Model quantization

Why does AWQ match GPTQ accuracy with simpler math?

GPTQ's Hessian-weighted compensation update is mathematically optimal for the layer-output reconstruction objective at fixed quantization grid — but the dominant source of layer-output error is outlier channels, and you can address those without solving the full quadratic-optimization problem. AWQ rescales the salient channels before quantization so that they round more accurately, then quantizes everything with naive per-channel rounding. The remaining error from the non-salient channels is small enough that no compensation update is needed. Empirically AWQ matches or slightly beats GPTQ on most LLaMA-family models with 5–10x faster calibration.

GPTQ

Can AWQ scaling be combined with NF4 or MX-formats?

Yes, in principle. The activation-aware scaling trick is orthogonal to the choice of quantization grid — you can absorb a per-channel scale into the weights and then quantize them with INT4, NF4, MXFP4, or NVFP4. In practice the production tools mostly do AWQ+INT4 because that's what the published kernels target, but research implementations (llm-awq, quantmlsys) have shown AWQ scaling improves NF4 quality and is being used as a building block inside Blackwell-targeted FP4 quantizers.

NF4 MXFP4

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs