Why is NF4 better than uniform INT4 if both are 4-bit?
Same bit budget, smarter level placement. Uniform INT4 spaces its 16 levels evenly across
Also known as: NormalFloat-4, QLoRA NF4, Gaussian-quantized 4-bit
A 4-bit weight format with 16 levels placed at the equiquantiles of the standard normal distribution rather than uniformly. Trained-network weights are approximately
NF4 — NormalFloat 4 — is the 4-bit weight format introduced by Dettmers et al. in the QLoRA paper (2023). It is the standard 4-bit storage format for QLoRA fine-tuning and one of the most widely deployed non-uniform quantization codes in production. The trick is single-line: choose the 16 quantization levels so that each bin carries equal probability mass under the standard normal, instead of placing them uniformly across the value range.
Weights of a trained transformer are approximately
The result is wasted resolution. Roughly half the INT4 levels sit out in the tails where almost no weight ever lands; the bulk near zero — where a quantizer’s resolution would actually matter — is covered by only 6–8 levels.
NF4 fixes this by placing its 16 levels at the quantiles of the standard normal:
where
Uniform INT4 (16 levels, evenly spaced):
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
-1 1
* * * * * * * * * * * (uniform — spends bits in the tails)
NF4 (16 levels, quantile-spaced under N(0,1)):
+---+--+-++++++++-+--+---+--------
-1 1
* * ********** * * (dense near zero, sparse at tails —
^^^^^^^^^^^ matches Gaussian density)
Where pretrained weights actually live (overlay):
/\
/ \
/ \ <- bell-curve mass
/ \
__/ \__
-1 1
The two grids carry the same number of levels — same 4 bits per weight — but NF4’s are placed where weights actually are.
A single global scale won’t work — weight magnitudes vary by orders of magnitude across a model. NF4 quantizes in blocks of 64 weights: each block is normalized by its own absmax to fit into level_lookup[code] * absmax.
NF4 block (64 weights):
+-------------------------------------------------+----------+
| 64 x 4-bit codes = 32 bytes | FP32 |
| c c c c c c c c c c c c c c c c c c c c ... | absmax |
+-------------------------------------------------+----------+
^ ^
fixed 16-level NF4 codebook per-block scale
(shared across all blocks) (4 bytes / 64 weights
= 0.5 bits/elem)
Weight reconstruction: w_i = NF4_TABLE[code_i] * absmax
Block size 64 is a deliberate trade. Smaller blocks track local distribution shifts better but pay more scale-overhead bits. 64 is the empirical sweet spot for transformer weights.
The information-theoretic argument: if your input is a continuous random variable
Equiquantile is not exactly Lloyd–Max-optimal under MSE — it minimizes a slightly different criterion (KL-style divergence to a uniform code distribution) — but it is asymptotically optimal for log-loss-style reconstruction error and within a few percent of Lloyd–Max in practice. It also has a delicious property: every code is equally likely, which means the 4-bit indices are themselves entropy-optimal under any downstream entropy coder, with no further compression possible.
The closed form is just the inverse CDF:
Reconstruction error vs uniform INT4. Take
Per-block FP32 absmaxes carry overhead: 32 bits per 64 weights = 0.5 bits/elem on top of the 4 bits/elem payload. That’s 12.5% — enough to erase a meaningful fraction of NF4’s compression.
QLoRA’s fix is double quantization: re-quantize the absmaxes themselves.
First-level quant (NF4):
weights --[block of 64]--> 4-bit codes + FP32 absmax (0.5 bits/elem overhead)
Second-level quant (8-bit):
take 256 absmaxes --> 8-bit codes + FP32 super-scale (super-block = 256 * 64 = 16k weights)
Overhead after double-quant:
8 bits / 64 weights = 0.125 bits/elem
+ 32 bits / (256 * 64) weights =~ 0.002 bits/elem
----------------------------------------------------
=~ 0.127 bits/elem total scale overhead (recovered ~0.4 bits/elem)
The absmaxes are themselves uniformly distributed enough that 8-bit quantization is essentially lossless. Double quantization is a small accounting trick that saves 0.4 bits per weight at no measurable quality cost — the difference between a 70B model fitting in 35 GB versus 38 GB.
NF4 is information-theoretically optimal for Gaussian weights at 4 bits. No clever rounding scheme, no calibration data, no Hessian-weighted updates — just the right level placement for the right distribution. That’s the whole trick.
NF4 ships in bitsandbytes, the canonical QLoRA library. The end-to-end pipeline:
The choice is hardware-shaped:
NF4 is the right default for the QLoRA-shaped problem: single-GPU 4-bit fine-tuning on hardware that doesn’t have FP4 tensor cores. For everything else, the MX-family or calibration-driven INT4 is now the better answer.
Same bit budget, smarter level placement. Uniform INT4 spaces its 16 levels evenly across
NF4 is per-block, typically with block size 64 — each block of 64 weights gets its own FP32 absmax scale. That scale is 32 bits per 64 weights = 0.5 bits/elem of overhead, which would erase a meaningful fraction of NF4's compression. Double quantization re-quantizes the per-block scales themselves: group 256 of those scales into a super-block, store them in INT8 with a single FP32 super-block scale. The overhead drops from 0.5 bits/elem to about 0.127 bits/elem — recovering roughly 0.4 bits per weight at no measurable quality cost.
NF4 is weight-only, non-uniform, and distribution-matched — best when you only need to compress weights and you trust them to be Gaussian (true for almost every pretrained transformer). MXFP4 / NVFP4 are floating-point microscaling formats with hardware support on Blackwell tensor cores; they compress activations as well as weights, support training, and are uniform-on-log-scale. Practical rule: NF4 for QLoRA-style memory-bound fine-tuning where the base model is frozen; MX/NV-FP4 for end-to-end low-precision inference and training on supported hardware.