NVFP4

Also known as: NV-FP4, NVIDIA FP4, Blackwell FP4

TL;DR

NVIDIA's variant of microscaling FP4, introduced with Blackwell. Blocks of 16 elements (vs MXFP4's 32) with an E4M3 FP8 scale (vs MXFP4's power-of-2 E8M0), plus an optional outer FP32 scale across multiple blocks.

NVFP4 is NVIDIA’s 4-bit floating-point format, introduced with the Blackwell architecture (B100, B200, GB200). It belongs to the same microscaling family as MXFP4 — per-block scale + 4-bit E2M1 elements — but differs in two specifics: smaller blocks (16 elements vs 32), and a richer scale type (E4M3 FP8 vs E8M0 power-of-2). It also supports an optional outer FP32 scale across multiple blocks for two-level scaling on activations.

Bit layout

NVFP4 (16 elements with two-level scaling):

  Outer scale:  ┌──────┐
                │ FP32 │  ← per-supergroup
                └──────┘
                   │
                   ▼
  Inner scale:  ┌────┐  ← per 16-element block
                │E4M3│
                │ 8 b│
                └────┘
                   │
                   ▼
  Elements:     ┌────┐┌────┐ ... ┌────┐  ← 16 × E2M1
                │E2M1││E2M1│     │E2M1│
                └────┘└────┘     └────┘

  Per-block storage: 8 bits (inner scale) + 16 × 4 bits = 72 bits per block
  Effective bits/element: 72 / 16 = 4.50
  (Outer FP32 scale amortizes across many blocks; effectively negligible.)

The inner scale is FP8 E4M3 (1 sign, 4 exponent, 3 mantissa) — a real floating-point number, not just a power-of-2. This gives 256 distinct scale values across the FP8 dynamic range, letting NVFP4 represent values that fall between MXFP4’s power-of-2 grid points.

Reconstruction of element in supergroup , block , position :

where:

is the per-supergroup FP32 scale (only present when two-level scaling is enabled),
is the per-16-element-block FP8 scale,
is the E2M1 lookup table for the 4-bit element code .

When the outer FP32 scale is omitted (typical for weights), the formula collapses to the single-level form:

Why two levels at all: activations in a transformer can have multi-magnitude outliers across different supergroups. A single E4M3 scale per 16-element block tracks fine-grained dynamic range within that block, but if supergroup A’s typical magnitude is 0.01 and supergroup B’s is 100, the inner E4M3 scales for A and B span very different exponents — and the inner scale’s representable range (E4M3 max ~448) might saturate. The outer FP32 multiplier provides the cross-supergroup dynamic-range buffer; inner scales then operate in a narrow, well-conditioned range.

For weights, which are statically distributed and don’t have token-by-token magnitude variation, the outer scale is usually omitted. For activations on certain transformers (especially with long-context KV cache values that span large magnitudes), it’s load-bearing.

How NVFP4 differs from MXFP4

Property	MXFP4	NVFP4
Block size	32	16
Inner scale	E8M0 (8-bit pow-2)	E4M3 (8-bit FP8)
Outer scale	none	optional FP32
Element format	E2M1	E2M1
Bits / element	4.25	4.50
Dequant cost	bit-shift	FP8 multiply
Native support	B100/B200, MI300X	B100/B200 only

The 0.25 bit/element premium pays for: half-size blocks (better outlier containment), finer scale precision (between-power-of-2 representable values), and optional outer scaling (multi-magnitude activations).

NVFP4 trades 0.25 bits/element for finer scale precision and tighter blocks. On Blackwell the throughput advantage of native NVFP4 makes the trade obvious. On non-NVIDIA hardware, MXFP4’s larger blocks and bit-shift dequant dominate.

Hardware throughput

Blackwell’s NVFP4 tensor cores run at 2× the throughput of FP8 on the same hardware — and FP8 was already 2× BF16. So NVFP4 → FP8 → BF16 is a 4× → 2× → 1× throughput stair on B100/B200. Memory bandwidth advantages compound on top of compute: NVFP4 weights are half the bytes of FP8 weights, so memory-bound decode kernels see roughly 2× speedup from bandwidth alone, before any compute speedup kicks in.

Training in NVFP4

NVFP4 training (full forward + backward in 4-bit) is an active research direction in 2026, not yet a boringly reliable production recipe. The challenges are familiar from FP8’s adoption curve: gradient ranges span many orders of magnitude, the second-moment statistics in Adam variants don’t fit cleanly in 4-bit, and the optimizer state often has to stay in higher precision. The practical training stack on Blackwell as of 2026 is FP8 forward + FP8 backward via Transformer Engine, with weights stored in NVFP4 and dequantized in the matmul. Pure NVFP4 training is the 2027 problem.

Go further

Why does NVFP4 use a smaller block than MXFP4?

16 elements vs 32 trades scale storage for outlier tolerance. A 16-element block tracks dynamic range tighter — half as many elements share each scale, so a single outlier elevates fewer siblings' representable range. The cost is double the relative scale overhead: 8 bits of scale per 16 elements is 0.5 bits/elem vs MXFP4's 0.25 bits/elem. Blackwell's tensor cores absorb the extra scale-decode cost natively, so on B100/B200 the trade is pure win. On non-Blackwell hardware you'd usually pick MXFP4 for the lower per-element overhead.

MXFP4

What is the outer FP32 scale for?

Two-level scaling: an inner E4M3 scale per 16-element block, plus an optional outer FP32 scale shared across some number of blocks (a 'supergroup'). The outer scale captures multi-magnitude variation across supergroups — when one part of an activation tensor is 100× larger than another part, a single inner scale can't span both ranges. The outer FP32 multiplier handles that, with the inner scale doing fine-grained range matching within the supergroup. For weights this is rarely needed; for activations with cross-token outliers it's load-bearing.

FP8 Model quantization

Is NVFP4 specific to NVIDIA hardware?

The format is. The OCP MX family (MXFP4/6/8) is a multi-vendor consortium standard; NVFP4 is NVIDIA's choice for Blackwell that the consortium did not adopt. Software libraries (Transformer Engine, vLLM, TensorRT-LLM) support both, and weights can be converted between them — but the in-memory layouts differ. If you publish a model checkpoint quantized to NVFP4, AMD MI300 hardware will need to repack it to MXFP4 for native execution.

MXFP4 vLLM serving

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs