NF4 — NormalFloat 4-Bit Quantization

Q: When should I use NF4 vs MXFP4 vs NVFP4?

NF4 is weight-only, non-uniform, and distribution-matched — best when you only need to compress weights and you trust them to be Gaussian (true for almost every pretrained transformer). MXFP4 / NVFP4 are floating-point microscaling formats with hardware support on Blackwell tensor cores; they compress activations as well as weights, support training, and are uniform-on-log-scale. Practical rule: NF4 for QLoRA-style memory-bound fine-tuning where the base model is frozen; MX/NV-FP4 for end-to-end low-precision inference and training on supported hardware.

Also known as: NormalFloat-4, QLoRA NF4, Gaussian-quantized 4-bit

TL;DR

A 4-bit weight format with 16 levels placed at the equiquantiles of the standard normal distribution rather than uniformly. Trained-network weights are approximately , so spending bits where the mass actually lives.

NF4 — NormalFloat 4 — is the 4-bit weight format introduced by Dettmers et al. in the QLoRA paper (2023). It is the standard 4-bit storage format for QLoRA fine-tuning and one of the most widely deployed non-uniform quantization codes in production. The trick is single-line: choose the 16 quantization levels so that each bin carries equal probability mass under the standard normal, instead of placing them uniformly across the value range.

The setup

Weights of a trained transformer are approximately — a fact that is robust across architectures, sizes, and training regimes. Most of the mass sits near zero; the tails decay as . A uniform INT4 grid ignores that geometry and treats as just as likely as , even though the latter is hundreds of times more common.

The result is wasted resolution. Roughly half the INT4 levels sit out in the tails where almost no weight ever lands; the bulk near zero — where a quantizer’s resolution would actually matter — is covered by only 6–8 levels.

The level placement

NF4 fixes this by placing its 16 levels at the quantiles of the standard normal:

where is the inverse CDF of . Each pair of adjacent levels then bounds a region with of the total probability mass — the levels are equiquantile. The actual NF4 codebook is symmetric around zero with one extra level at exactly , normalized so the codes lie in .

Uniform INT4 (16 levels, evenly spaced):
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  -1                              1
   *  *  *  *  *  *  *  *  *  *  *   (uniform — spends bits in the tails)


NF4 (16 levels, quantile-spaced under N(0,1)):
  +---+--+-++++++++-+--+---+--------
  -1                              1
        *  * ********** *  *         (dense near zero, sparse at tails —
         ^^^^^^^^^^^                    matches Gaussian density)


Where pretrained weights actually live (overlay):
         /\
        /  \
       /    \                <- bell-curve mass
      /      \
   __/        \__
  -1            1

The two grids carry the same number of levels — same 4 bits per weight — but NF4’s are placed where weights actually are.

Per-block scaling

A single global scale won’t work — weight magnitudes vary by orders of magnitude across a model. NF4 quantizes in blocks of 64 weights: each block is normalized by its own absmax to fit into , the 4-bit code is stored, and reconstruction is level_lookup[code] * absmax.

NF4 block (64 weights):
  +-------------------------------------------------+----------+
  |  64 x 4-bit codes  =  32 bytes                  |  FP32    |
  |  c c c c c c c c c c c c c c c c c c c c ...    |  absmax  |
  +-------------------------------------------------+----------+
                    ^                                     ^
           fixed 16-level NF4 codebook              per-block scale
              (shared across all blocks)         (4 bytes / 64 weights
                                                   = 0.5 bits/elem)

  Weight reconstruction:  w_i  =  NF4_TABLE[code_i]  *  absmax

Block size 64 is a deliberate trade. Smaller blocks track local distribution shifts better but pay more scale-overhead bits. 64 is the empirical sweet spot for transformer weights.

The information-theoretic argument: if your input is a continuous random variable with density , and you’re allowed discrete levels, the minimum-distortion placement (under MSE) is roughly the Lloyd–Max quantizer, which iteratively refines bin boundaries and centroids. For Gaussian inputs the Lloyd–Max solution has no closed form, but a clean approximation is the equiquantile placement: pick levels such that for all .

Equiquantile is not exactly Lloyd–Max-optimal under MSE — it minimizes a slightly different criterion (KL-style divergence to a uniform code distribution) — but it is asymptotically optimal for log-loss-style reconstruction error and within a few percent of Lloyd–Max in practice. It also has a delicious property: every code is equally likely, which means the 4-bit indices are themselves entropy-optimal under any downstream entropy coder, with no further compression possible.

The closed form is just the inverse CDF: for . Symmetrize around zero, normalize so the codes lie in , and you have NF4.

Reconstruction error vs uniform INT4. Take , quantize with each codebook, measure . NF4 wins by in MSE on Gaussian input — exactly the regime that matches pretrained transformer weights. End-to-end on LLaMA-65B, that translates to roughly 0.3–0.5 perplexity recovered vs uniform INT4 at the same bit count.

Double quantization

Per-block FP32 absmaxes carry overhead: 32 bits per 64 weights = 0.5 bits/elem on top of the 4 bits/elem payload. That’s 12.5% — enough to erase a meaningful fraction of NF4’s compression.

QLoRA’s fix is double quantization: re-quantize the absmaxes themselves.

First-level quant (NF4):
   weights --[block of 64]--> 4-bit codes  +  FP32 absmax    (0.5 bits/elem overhead)

Second-level quant (8-bit):
   take 256 absmaxes --> 8-bit codes  +  FP32 super-scale   (super-block = 256 * 64 = 16k weights)

Overhead after double-quant:
   8 bits / 64 weights  =  0.125 bits/elem
        +  32 bits / (256 * 64) weights  =~  0.002 bits/elem
   ----------------------------------------------------
   =~ 0.127 bits/elem total scale overhead   (recovered ~0.4 bits/elem)

The absmaxes are themselves uniformly distributed enough that 8-bit quantization is essentially lossless. Double quantization is a small accounting trick that saves 0.4 bits per weight at no measurable quality cost — the difference between a 70B model fitting in 35 GB versus 38 GB.

NF4 is information-theoretically optimal for Gaussian weights at 4 bits. No clever rounding scheme, no calibration data, no Hessian-weighted updates — just the right level placement for the right distribution. That’s the whole trick.

Where it sits in the format zoo

NF4 vs sibling 4-bit formats

Uniform INT4. 16 levels, evenly spaced. The naive baseline. Wastes resolution on the tails. Used pre-NF4 and still common in non-quantization-aware kernels.
NF4. 16 levels, quantile-spaced under . Weight-only. No hardware support — runs on dequantize-then-FP16-matmul kernels. The QLoRA default.
MXFP4 . Microscaling FP4 (E2M1) with shared 8-bit exponent per 32-element block. Floating-point, log-scale spacing. Hardware support on Blackwell. Supports training.
NVFP4 . NVIDIA’s variant — same E2M1 mantissa, FP8-encoded per-block scale plus a tensor-level FP32 scale. Even better quality than MXFP4 at the same bit count.
GPTQ-INT4 / AWQ-INT4 . Uniform INT4 with calibration-data-driven scaling tricks — a different attack on the same problem. Composable with NF4-style level placement in some experimental kernels.

Practical use

NF4 ships in bitsandbytes, the canonical QLoRA library. The end-to-end pipeline:

Load FP16 base model.
For each linear layer’s weight matrix, partition into 64-element blocks, store NF4 codes + absmaxes.
Apply double quantization to the absmaxes.
At inference: stream NF4 codes from HBM, dequantize to FP16 in a fused kernel, multiply with the FP16 activation, accumulate in FP32. The trick that makes it tolerable on memory-bound decode is that the dequantize is fused into the matmul — no full FP16 weight tensor ever materializes in HBM.
For QLoRA fine-tuning: train LoRA adapters in BF16 on top of the frozen NF4 base. The adapters’ gradients flow only into the rank- low-rank pair, never into the NF4 weights.

When to pick NF4 vs the MX-family

The choice is hardware-shaped:

Pick NF4 when you only need weight quantization, you don’t have FP4 tensor cores, and you trust the weight distribution to be Gaussian. QLoRA-style memory-bound fine-tuning, single-GPU inference of large models on Ampere/Hopper. The kernel cost is dequantize-then-FP16-matmul — fine on memory-bound decode, less ideal on compute-bound prefill.
Pick MXFP4 / NVFP4 when you have Blackwell-class hardware with native FP4 tensor cores, you want to compress activations and weights, or you want a format that supports training. The hardware does the matmul in 4-bit directly — no dequantize step — which dominates NF4 on prefill throughput.
Pick AWQ or GPTQ when you want to push past NF4’s quality with calibration-data-driven scaling tricks, especially on outlier-heavy models where the Gaussian assumption breaks down. GGUF K-quants take a third path: per-super-block hierarchical scaling tuned for CPU inference.

NF4 is the right default for the QLoRA-shaped problem: single-GPU 4-bit fine-tuning on hardware that doesn’t have FP4 tensor cores. For everything else, the MX-family or calibration-driven INT4 is now the better answer.

Go further

Why is NF4 better than uniform INT4 if both are 4-bit?

Same bit budget, smarter level placement. Uniform INT4 spaces its 16 levels evenly across , which wastes most of its precision on the tails where pretrained weights almost never live. NF4 places its 16 levels at the inverse-CDF quantiles of — dense near zero, sparse at the tails — so each bin carries roughly equal probability mass under the empirical weight distribution. The result is information-theoretically optimal coding for the actual distribution, and on LLaMA-style models NF4 typically saves 0.3–0.5 perplexity over uniform INT4 at the same bit count.

Model quantization Normal distribution

What is double quantization and how much does it save?

NF4 is per-block, typically with block size 64 — each block of 64 weights gets its own FP32 absmax scale. That scale is 32 bits per 64 weights = 0.5 bits/elem of overhead, which would erase a meaningful fraction of NF4's compression. Double quantization re-quantizes the per-block scales themselves: group 256 of those scales into a super-block, store them in INT8 with a single FP32 super-block scale. The overhead drops from 0.5 bits/elem to about 0.127 bits/elem — recovering roughly 0.4 bits per weight at no measurable quality cost.

LoRA & PEFT

When should I use NF4 vs MXFP4 vs NVFP4?

NF4 is weight-only, non-uniform, and distribution-matched — best when you only need to compress weights and you trust them to be Gaussian (true for almost every pretrained transformer). MXFP4 / NVFP4 are floating-point microscaling formats with hardware support on Blackwell tensor cores; they compress activations as well as weights, support training, and are uniform-on-log-scale. Practical rule: NF4 for QLoRA-style memory-bound fine-tuning where the base model is frozen; MX/NV-FP4 for end-to-end low-precision inference and training on supported hardware.

MXFP4 NVFP4

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs