NVIDIA's variant of microscaling FP4, introduced with Blackwell. Blocks of 16 elements (vs MXFP4's 32) with an E4M3 FP8 scale (vs MXFP4's power-of-2 E8M0), plus an optional outer FP32 scale across multiple blocks.
NVFP4 is NVIDIA’s 4-bit floating-point format, introduced with the Blackwell architecture (B100, B200, GB200). It belongs to the same microscaling family as MXFP4 — per-block scale + 4-bit E2M1 elements — but differs in two specifics: smaller blocks (16 elements vs 32), and a richer scale type (E4M3 FP8 vs E8M0 power-of-2). It also supports an optional outer FP32 scale across multiple blocks for two-level scaling on activations.
The inner scale is FP8 E4M3 (1 sign, 4 exponent, 3 mantissa) — a real floating-point number, not just a power-of-2. This gives 256 distinct scale values across the FP8 dynamic range, letting NVFP4 represent values that fall between MXFP4’s power-of-2 grid points.
Reconstruction of element in supergroup , block , position :
where:
is the per-supergroup FP32 scale (only present when two-level scaling is enabled),
is the per-16-element-block FP8 scale,
is the E2M1 lookup table for the 4-bit element code .
When the outer FP32 scale is omitted (typical for weights), the formula collapses to the single-level form:
Why two levels at all: activations in a transformer can have multi-magnitude outliers across different supergroups. A single E4M3 scale per 16-element block tracks fine-grained dynamic range within that block, but if supergroup A’s typical magnitude is 0.01 and supergroup B’s is 100, the inner E4M3 scales for A and B span very different exponents — and the inner scale’s representable range (E4M3 max ~448) might saturate. The outer FP32 multiplier provides the cross-supergroup dynamic-range buffer; inner scales then operate in a narrow, well-conditioned range.
For weights, which are statically distributed and don’t have token-by-token magnitude variation, the outer scale is usually omitted. For activations on certain transformers (especially with long-context KV cache values that span large magnitudes), it’s load-bearing.
How NVFP4 differs from MXFP4
Property
MXFP4
NVFP4
Block size
32
16
Inner scale
E8M0 (8-bit pow-2)
E4M3 (8-bit FP8)
Outer scale
none
optional FP32
Element format
E2M1
E2M1
Bits / element
4.25
4.50
Dequant cost
bit-shift
FP8 multiply
Native support
B100/B200, MI300X
B100/B200 only
The 0.25 bit/element premium pays for: half-size blocks (better outlier containment), finer scale precision (between-power-of-2 representable values), and optional outer scaling (multi-magnitude activations).
NVFP4 trades 0.25 bits/element for finer scale precision and tighter blocks. On Blackwell the throughput advantage of native NVFP4 makes the trade obvious. On non-NVIDIA hardware, MXFP4’s larger blocks and bit-shift dequant dominate.
Hardware throughput
Blackwell’s NVFP4 tensor cores run at 2× the throughput of FP8 on the same hardware — and FP8 was already 2× BF16. So NVFP4 → FP8 → BF16 is a 4× → 2× → 1× throughput stair on B100/B200. Memory bandwidth advantages compound on top of compute: NVFP4 weights are half the bytes of FP8 weights, so memory-bound decode kernels see roughly 2× speedup from bandwidth alone, before any compute speedup kicks in.
Training in NVFP4
NVFP4 training (full forward + backward in 4-bit) is an active research direction in 2026, not yet a boringly reliable production recipe. The challenges are familiar from FP8’s adoption curve: gradient ranges span many orders of magnitude, the second-moment statistics in Adam variants don’t fit cleanly in 4-bit, and the optimizer state often has to stay in higher precision. The practical training stack on Blackwell as of 2026 is FP8 forward + FP8 backward via Transformer Engine, with weights stored in NVFP4 and dequantized in the matmul. Pure NVFP4 training is the 2027 problem.
Go further
Why does NVFP4 use a smaller block than MXFP4?
16 elements vs 32 trades scale storage for outlier tolerance. A 16-element block tracks dynamic range tighter — half as many elements share each scale, so a single outlier elevates fewer siblings' representable range. The cost is double the relative scale overhead: 8 bits of scale per 16 elements is 0.5 bits/elem vs MXFP4's 0.25 bits/elem. Blackwell's tensor cores absorb the extra scale-decode cost natively, so on B100/B200 the trade is pure win. On non-Blackwell hardware you'd usually pick MXFP4 for the lower per-element overhead.
Two-level scaling: an inner E4M3 scale per 16-element block, plus an optional outer FP32 scale shared across some number of blocks (a 'supergroup'). The outer scale captures multi-magnitude variation across supergroups — when one part of an activation tensor is 100× larger than another part, a single inner scale can't span both ranges. The outer FP32 multiplier handles that, with the inner scale doing fine-grained range matching within the supergroup. For weights this is rarely needed; for activations with cross-token outliers it's load-bearing.
The format is. The OCP MX family (MXFP4/6/8) is a multi-vendor consortium standard; NVFP4 is NVIDIA's choice for Blackwell that the consortium did not adopt. Software libraries (Transformer Engine, vLLM, TensorRT-LLM) support both, and weights can be converted between them — but the in-memory layouts differ. If you publish a model checkpoint quantized to NVFP4, AMD MI300 hardware will need to repack it to MXFP4 for native execution.