MXFP4

Q: Why E8M0 for the scale and not E4M3 or E5M2?

E8M0 is a power-of-2 scale — 8 exponent bits, zero mantissa bits, no sign. Dequantization becomes a bit-shift: FORMULA. That's effectively free on hardware. An E4M3 scale would carry mantissa precision the application doesn't need (the per-element E2M1 already provides mantissa), and would cost a multiply per dequant. NVFP4 makes the opposite trade — finer scale precision at the cost of a real multiply — which is defensible only because Blackwell's tensor cores absorb that multiply natively.

Also known as: MX-FP4, Microscaling FP4, OCP MXFP4, block-FP4

TL;DR

The Open Compute Project's microscaling 4-bit float. Blocks of 32 elements share a single 8-bit power-of-2 scale (E8M0); each element is a 4-bit micro-float (E2M1). Effective storage: 4.25 bits per weight.

MXFP4 (Microscaling FP4) is the Open Compute Project’s standardized 4-bit floating-point format with per-block scaling. A block is 32 contiguous elements; each block carries a single 8-bit power-of-2 scale, and each element is a 4-bit micro-float in the E2M1 layout (1 sign, 2 exponent, 1 mantissa). The OCP MX specification was ratified in 2023 by an AMD + Intel + Microsoft + NVIDIA + ARM + Meta + Qualcomm consortium and is now the de-facto interchange format for sub-8-bit numerics across vendors.

Bit layout

The whole format fits in a single ASCII block:

MXFP4 block (32 elements):

  Scale  ─┐
          ▼
  ┌──────────┐  ┌────┐┌────┐┌────┐ ... ┌────┐  ← 32 × 4-bit elements
  │ E8M0     │  │E2M1││E2M1││E2M1│     │E2M1│
  │ 8 bits   │  │4 b ││4 b ││4 b │     │4 b │
  └──────────┘  └────┘└────┘└────┘     └────┘
       │
       └─ 8-bit unsigned exponent: scale = 2^(e − 127)

  Storage: 8 bits scale + 32 × 4 bits = 136 bits per block
  Effective bits/element: 136 / 32 = 4.25

The scale is stored as an unsigned 8-bit biased exponent: bit pattern decodes to , with reserved for NaN. Each element is reconstructed as where is the lookup-table mapping from the 4-bit E2M1 code to a fp32 micro-float in .

E8M0 stores for (with NaN as the 256th codepoint). Dequantization of element in a block with scale-exponent and 4-bit code is:

where is the 16-entry E2M1 lookup table. Because the scale is exactly a power of two, the multiply is just adding to the float exponent of the looked-up element — a fused bit-shift and a 4-bit table lookup, no actual multiplier needed.

The E2M1 element format (1 sign, 2 exponent, 1 mantissa) gives 16 distinct values: (subnormals included). The maximum absolute representable element is 6, so the maximum representable absolute value in the entire block is — overflow is essentially impossible at any scale you’d actually encounter in training or inference.

The minimum nonzero is ; below that the element rounds to zero. Within a block, the dynamic range from minimum-nonzero to maximum-absolute is 12× — a respectable two-and-a-bit decades of magnitude per block, which is why the per-block scale doesn’t have to be exotically chosen.

Why per-block scaling beats per-channel

The pre-MX 4-bit world used a single scale per output channel of the weight matrix — so an entire row’s worth of weights (often 4096 or 11008 elements) shared one scale. That works when the row’s dynamic range is uniform; it fails when one or two elements are 10× the rest, because the scale is forced to accommodate them and every well-behaved element gets squashed into the bottom half of the 4-bit range.

MXFP4’s contribution is dynamic-range tracking at scale-per-32-elements granularity. Outliers no longer poison their channel; they’re contained to their own 32-element block, and 31 well-behaved siblings keep their full 4-bit precision. This is why MXFP4 lands within ~0.3 perplexity of FP16 on LLaMA-class models where uniform-scaled INT4 lands at ~0.8.

The cost is the 8-bit scale per block, which adds 0.25 bits of overhead per element. Total: 4.25 bits per element vs INT4’s 4.0 — a 6% storage premium for materially better accuracy.

Hardware support

The MX family was deliberately co-designed across hardware vendors so software ecosystems wouldn’t fragment.

Native MXFP4 support

NVIDIA Blackwell (B100, B200, GB200). Native MXFP4 tensor-core multiply-accumulate. Same compute path as NVFP4 , selected by metadata.
AMD MI300X / MI325. Native via the XDL (matrix accelerator) instructions; throughput parity with FP8 on the same hardware.
NVIDIA Hopper (H100, H200). No native MXFP4 multiply, but MXFP4 storage is supported — weights live in HBM as MXFP4, dequantize to FP8 or BF16 in the matmul kernel. Lower throughput than native but the memory savings still apply.
Trainium2. Partial — supported as a storage format only.

The OCP MX family beyond MXFP4

MXFP4 is one of several formats under the same MX umbrella, all sharing the 32-element-block + E8M0-scale chassis but varying the element format:

Format	Element	Bits/element	Effective bits/element
MXFP4	E2M1	4	4.25
MXFP6	E2M3 / E3M2	6	6.25
MXFP8	E4M3 / E5M2	8	8.25
MXINT8	INT8	8	8.25

MXFP6 is the under-rated middle child — it gives you 1.5 extra bits per element for ~2% storage premium and recovers nearly all the FP8 quality at much lower memory pressure. MXFP8 is what Blackwell uses for FP8 training (vs the H100-style monolithic FP8 with delayed scaling).

When to choose MXFP4

For inference: when you’re serving on Blackwell or MI300, MXFP4 is the new default for sub-8-bit weight quantization. Native hardware support means it’s strictly better than GPTQ-INT4 in both throughput and accuracy.

For training: MXFP4 training is research-grade as of 2026 — the recipes are not yet boringly reliable at the 70B+ scale. MXFP6 or MXFP8 is what you use for MX-format training; MXFP4 remains an inference format.

For non-NVIDIA hardware without native MX support: storage-only MXFP4 (dequant to FP8/BF16 in-kernel) still saves HBM but loses the throughput advantage. AWQ-INT4 remains competitive and often wins because the kernel ecosystem (vLLM, TensorRT-LLM) is more mature.

Go further

Why E8M0 for the scale and not E4M3 or E5M2?

E8M0 is a power-of-2 scale — 8 exponent bits, zero mantissa bits, no sign. Dequantization becomes a bit-shift: . That's effectively free on hardware. An E4M3 scale would carry mantissa precision the application doesn't need (the per-element E2M1 already provides mantissa), and would cost a multiply per dequant. NVFP4 makes the opposite trade — finer scale precision at the cost of a real multiply — which is defensible only because Blackwell's tensor cores absorb that multiply natively.

NVFP4 FP8

Why 32-element blocks?

Trade-off between scale-overhead amortization and dynamic-range fidelity. Smaller blocks (16, 8) track outliers better — each block's scale is set by its own absmax — but every block costs 8 bits of scale storage, so the overhead grows. Larger blocks (64, 128) amortize scale storage but waste resolution when one outlier elevates the absmax for 127 well-behaved siblings. The OCP committee landed on 32 as the empirical knee.

Model quantization

How does MXFP4 accuracy compare to GPTQ-INT4 or AWQ-INT4?

On most LLaMA-class models at inference, MXFP4 lands within ~0.3 perplexity of FP16 versus ~0.5-1.0 for GPTQ-INT4 and ~0.3-0.5 for AWQ-INT4. The win comes from the per-32-element scale — a uniform INT4 with one per-channel scale wastes resolution on every block whose dynamic range is far from the channel-wide absmax. MXFP4's per-block scaling is always near-optimal for its block. The catch: MXFP4 needs hardware-native support to be performant; on hardware that has to emulate it, GPTQ/AWQ INT4 still win on throughput.

GPTQ AWQ

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs