Quantization-Aware Training (QAT)

Also known as: quantization-aware training, QAT, fake quantization, simulated quantization

TL;DR

Training a model with quantization simulated in the forward pass so the weights co-adapt to low-precision rounding. Recovers quality that post-training quantization loses, at the cost of a fine-tuning run. The standard recipe for sub-4-bit deployment.

Quantization-aware training is fine-tuning with deployment-time quantization simulated inside the forward pass — weights rounded to the target precision before each matmul — so the model co-adapts to rounding noise instead of meeting it cold at deployment. Unlike post-training quantization (PTQ), which quantizes a finished FP16 checkpoint in a single pass, QAT folds the quantization error into the optimization loop. The optimizer sees a quantized network on every step and learns weights intrinsically robust to the bit grid they will live on. It is the standard recipe below 4 bits and for end-to-end activation-quantized configurations.

The mechanic

The forward pass uses fake quantization: each weight tensor is rounded to the INT4 (or NF4, FP4, etc.) grid and immediately dequantized back to FP16, so the matmul is mathematically identical to what the deployed kernel produces. The backward pass uses the straight-through estimator (STE): the gradient of the round — zero almost everywhere — is replaced with the identity, and gradients flow through into a full-precision shadow copy of the weights that the optimizer updates.

forward:   w_fp32  --(round to INT4 grid, dequantize)-->  w_fake  --(matmul)-->  y
backward:  dL/dw_fp32  <--(STE: pretend round = identity)--  dL/dw_fake
optimizer: w_fp32  <-  w_fp32 - lr * dL/dw_fp32

After training, the shadow weights are discarded; only the quantized tensors ship.

Why it works

Rounding error is no longer a perturbation applied at deployment — it’s a constraint the optimizer adapts to. Weights drift toward configurations whose post-rounding values are the values the loss prefers, clustering near the quantization bins.

PTQ finds the FP16 weights that minimize the loss, then rounds and hopes. QAT finds weights that minimize the loss after rounding. The objectives diverge once the bit grid is coarse enough that rounding meaningfully shifts the loss landscape — the regime where PTQ falls apart.

Where it sits relative to PTQ

The cost-benefit tradeoff vs PTQ is steep and bit-width-dependent:

8-bit (INT8 / FP8 ). PTQ is essentially lossless. QAT is overkill.
4-bit ( NF4 , AWQ , GPTQ ). PTQ loses 1–3% on standard benchmarks; QAT closes most of the gap, but the cost rarely pays unless quality is mission-critical.
3-bit and below. PTQ falls off a cliff — 10–20% drops on reasoning benchmarks. QAT is the standard recipe to recover.
Mixed schemes (W4A8, W4A4, per-channel scales). QAT with channel-level learned scales recovers far more than PTQ, because activation outliers are something the model can be trained to suppress.

QAT in production

Llama 3.1 INT4 QAT release (Meta, 2024) — official 4-bit checkpoints, near-lossless on math and reasoning evals where PTQ-INT4 visibly degraded.
Microscaling formats (MXFP4, NVFP4). Almost always QAT’d in practice; their block-scale structure benefits disproportionately from co-adapted weights.
Distillation + QAT. Train the QAT student against the FP16 teacher’s logits or hidden states — see knowledge distillation . Recovers what a cold cross-entropy loss leaves on the table.
SmoothQuant-style W8A8 QAT. With INT8 activations, QAT learns to suppress outliers that PTQ’s per-channel rescaling can only paper over.

Activations are harder than weights — input-dependent, outlier-heavy, often 50–100× the typical magnitude on a single token. PTQ on activations is limited to algebraic tricks (SmoothQuant migrates outlier mass into the weights). QAT lets the model learn to suppress those outliers: simulated quantization in the forward pass penalizes outlier-producing weight configurations through the loss. PTQ has no equivalent. This is why W4A4 almost always requires QAT — pure PTQ at W4A4 typically loses 20%+ on hard benchmarks; QAT recovers most of it.

Start from an FP16 checkpoint. Wrap every nn.Linear with a FakeQuantize module that quantizes forward, STE-backward, and holds the full-precision shadow copy. Fine-tune for 1–10B tokens at a small learning rate ( to ) on a distribution close to inference traffic. Optionally distill against an FP16 teacher. Total cost: ~0.5–5% of pretraining compute — cheaper than retraining, much more expensive than PTQ’s single calibration pass.

The 2026 rule of thumb: PTQ at 4 bits and above for weight-only quantization, QAT below 4 bits or for end-to-end weight-and-activation quantization. Below 3 bits, QAT plus distillation is the only recipe that consistently holds quality at production scale.

The orthogonal recipe is QLoRA : freeze a PTQ’d base, train a BF16 LoRA on top. Cheap, but the base never co-adapts. QAT is what you reach for when QLoRA’s ceiling is the bottleneck — and the two compose, since QAT’d weights still accept LoRA adapters at serve time. The GGUF CPU stack and mixed-precision training GPU stack both load QAT’d checkpoints transparently — to the serving kernels, they’re just lower-bit weights.

Go further

When is QAT worth the extra training compute?

Below 4 bits, where post-training quantization (PTQ) loses 5–15% accuracy on hard benchmarks. At 8-bit, PTQ is essentially lossless; at 4-bit, PTQ with AWQ or GPTQ is usually fine. At 2- or 3-bit, QAT is what closes the gap — and the only way to recover quality on activation-quantized W4A4 configurations.

AWQ GPTQ

How does the forward pass simulate quantization without losing gradients?

Fake-quantize the weights in the forward pass — round and dequantize back to full precision before each matmul — then use the straight-through estimator (STE) on the backward pass, passing gradients through the round as if it were the identity function. The model sees quantized weights at every step, but gradients still flow continuously into the full-precision shadow copy that the optimizer updates.

QAT vs LoRA + quantization (QLoRA)?

QLoRA freezes a quantized base model and trains a full-precision LoRA on top — cheap, but the base weights never co-adapt to the quantization. QAT trains the base weights themselves under simulated quantization, which is what closes the sub-4-bit quality gap. QLoRA is the right answer when you want adapters; QAT is the right answer when you want the deployed weights to be intrinsically quantization-robust.

LoRA / PEFT

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs