NVFP4

Also known as: NV-FP4, NVIDIA FP4, Blackwell FP4

TL;DR

NVIDIA's variant of microscaling FP4, introduced with Blackwell. Blocks of 16 elements (vs MXFP4's 32) with an E4M3 FP8 scale (vs MXFP4's power-of-2 E8M0), plus an optional outer FP32 scale across multiple blocks.

NVFP4 is NVIDIA’s 4-bit floating-point format, introduced with the Blackwell architecture (B100, B200, GB200). It belongs to the same microscaling family as — per-block scale + 4-bit E2M1 elements — but differs in two specifics: smaller blocks (16 elements vs 32), and a richer scale type (E4M3 FP8 vs E8M0 power-of-2). It also supports an optional outer FP32 scale across multiple blocks for two-level scaling on activations.

Bit layout

NVFP4 (16 elements with two-level scaling):

  Outer scale:  ┌──────┐
                │ FP32 │  ← per-supergroup
                └──────┘


  Inner scale:  ┌────┐  ← per 16-element block
                │E4M3│
                │ 8 b│
                └────┘


  Elements:     ┌────┐┌────┐ ... ┌────┐  ← 16 × E2M1
                │E2M1││E2M1│     │E2M1│
                └────┘└────┘     └────┘

  Per-block storage: 8 bits (inner scale) + 16 × 4 bits = 72 bits per block
  Effective bits/element: 72 / 16 = 4.50
  (Outer FP32 scale amortizes across many blocks; effectively negligible.)

The inner scale is FP8 E4M3 (1 sign, 4 exponent, 3 mantissa) — a real floating-point number, not just a power-of-2. This gives 256 distinct scale values across the FP8 dynamic range, letting NVFP4 represent values that fall between MXFP4’s power-of-2 grid points.

Reconstruction of element in supergroup , block , position :

where:

  • is the per-supergroup FP32 scale (only present when two-level scaling is enabled),
  • is the per-16-element-block FP8 scale,
  • is the E2M1 lookup table for the 4-bit element code .

When the outer FP32 scale is omitted (typical for weights), the formula collapses to the single-level form:

Why two levels at all: activations in a transformer can have multi-magnitude outliers across different supergroups. A single E4M3 scale per 16-element block tracks fine-grained dynamic range within that block, but if supergroup A’s typical magnitude is 0.01 and supergroup B’s is 100, the inner E4M3 scales for A and B span very different exponents — and the inner scale’s representable range (E4M3 max ~448) might saturate. The outer FP32 multiplier provides the cross-supergroup dynamic-range buffer; inner scales then operate in a narrow, well-conditioned range.

For weights, which are statically distributed and don’t have token-by-token magnitude variation, the outer scale is usually omitted. For activations on certain transformers (especially with long-context KV cache values that span large magnitudes), it’s load-bearing.

How NVFP4 differs from MXFP4

PropertyMXFP4NVFP4
Block size3216
Inner scaleE8M0 (8-bit pow-2)E4M3 (8-bit FP8)
Outer scalenoneoptional FP32
Element formatE2M1E2M1
Bits / element4.254.50
Dequant costbit-shiftFP8 multiply
Native supportB100/B200, MI300XB100/B200 only

The 0.25 bit/element premium pays for: half-size blocks (better outlier containment), finer scale precision (between-power-of-2 representable values), and optional outer scaling (multi-magnitude activations).

NVFP4 trades 0.25 bits/element for finer scale precision and tighter blocks. On Blackwell the throughput advantage of native NVFP4 makes the trade obvious. On non-NVIDIA hardware, MXFP4’s larger blocks and bit-shift dequant dominate.

Hardware throughput

Blackwell’s NVFP4 tensor cores run at 2× the throughput of FP8 on the same hardware — and FP8 was already 2× BF16. So NVFP4 → FP8 → BF16 is a 4× → 2× → 1× throughput stair on B100/B200. Memory bandwidth advantages compound on top of compute: NVFP4 weights are half the bytes of FP8 weights, so memory-bound decode kernels see roughly 2× speedup from bandwidth alone, before any compute speedup kicks in.

Training in NVFP4

NVFP4 training (full forward + backward in 4-bit) is an active research direction in 2026, not yet a boringly reliable production recipe. The challenges are familiar from FP8’s adoption curve: gradient ranges span many orders of magnitude, the second-moment statistics in Adam variants don’t fit cleanly in 4-bit, and the optimizer state often has to stay in higher precision. The practical training stack on Blackwell as of 2026 is FP8 forward + FP8 backward via Transformer Engine, with weights stored in NVFP4 and dequantized in the matmul. Pure NVFP4 training is the 2027 problem.

Go further

Why does NVFP4 use a smaller block than MXFP4?

16 elements vs 32 trades scale storage for outlier tolerance. A 16-element block tracks dynamic range tighter — half as many elements share each scale, so a single outlier elevates fewer siblings' representable range. The cost is double the relative scale overhead: 8 bits of scale per 16 elements is 0.5 bits/elem vs MXFP4's 0.25 bits/elem. Blackwell's tensor cores absorb the extra scale-decode cost natively, so on B100/B200 the trade is pure win. On non-Blackwell hardware you'd usually pick MXFP4 for the lower per-element overhead.

What is the outer FP32 scale for?

Two-level scaling: an inner E4M3 scale per 16-element block, plus an optional outer FP32 scale shared across some number of blocks (a 'supergroup'). The outer scale captures multi-magnitude variation across supergroups — when one part of an activation tensor is 100× larger than another part, a single inner scale can't span both ranges. The outer FP32 multiplier handles that, with the inner scale doing fine-grained range matching within the supergroup. For weights this is rarely needed; for activations with cross-token outliers it's load-bearing.

Is NVFP4 specific to NVIDIA hardware?

The format is. The OCP MX family (MXFP4/6/8) is a multi-vendor consortium standard; NVFP4 is NVIDIA's choice for Blackwell that the consortium did not adopt. Software libraries (Transformer Engine, vLLM, TensorRT-LLM) support both, and weights can be converted between them — but the in-memory layouts differ. If you publish a model checkpoint quantized to NVFP4, AMD MI300 hardware will need to repack it to MXFP4 for native execution.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord