Why E8M0 for the scale and not E4M3 or E5M2?
E8M0 is a power-of-2 scale — 8 exponent bits, zero mantissa bits, no sign. Dequantization becomes a bit-shift:
Also known as: MX-FP4, Microscaling FP4, OCP MXFP4, block-FP4
The Open Compute Project's microscaling 4-bit float. Blocks of 32 elements share a single 8-bit power-of-2 scale (E8M0); each element is a 4-bit micro-float (E2M1). Effective storage: 4.25 bits per weight.
MXFP4 (Microscaling FP4) is the Open Compute Project’s standardized 4-bit floating-point format with per-block scaling. A block is 32 contiguous elements; each block carries a single 8-bit power-of-2 scale, and each element is a 4-bit micro-float in the E2M1 layout (1 sign, 2 exponent, 1 mantissa). The OCP MX specification was ratified in 2023 by an AMD + Intel + Microsoft + NVIDIA + ARM + Meta + Qualcomm consortium and is now the de-facto interchange format for sub-8-bit numerics across vendors.
The whole format fits in a single ASCII block:
MXFP4 block (32 elements):
Scale ─┐
▼
┌──────────┐ ┌────┐┌────┐┌────┐ ... ┌────┐ ← 32 × 4-bit elements
│ E8M0 │ │E2M1││E2M1││E2M1│ │E2M1│
│ 8 bits │ │4 b ││4 b ││4 b │ │4 b │
└──────────┘ └────┘└────┘└────┘ └────┘
│
└─ 8-bit unsigned exponent: scale = 2^(e − 127)
Storage: 8 bits scale + 32 × 4 bits = 136 bits per block
Effective bits/element: 136 / 32 = 4.25
The scale is stored as an unsigned 8-bit biased exponent: bit pattern
E8M0 stores
where
The E2M1 element format (1 sign, 2 exponent, 1 mantissa) gives 16 distinct values:
The minimum nonzero is
The pre-MX 4-bit world used a single scale per output channel of the weight matrix — so an entire row’s worth of weights (often 4096 or 11008 elements) shared one scale. That works when the row’s dynamic range is uniform; it fails when one or two elements are 10× the rest, because the scale is forced to accommodate them and every well-behaved element gets squashed into the bottom half of the 4-bit range.
MXFP4’s contribution is dynamic-range tracking at scale-per-32-elements granularity. Outliers no longer poison their channel; they’re contained to their own 32-element block, and 31 well-behaved siblings keep their full 4-bit precision. This is why MXFP4 lands within ~0.3 perplexity of FP16 on LLaMA-class models where uniform-scaled INT4 lands at ~0.8.
The cost is the 8-bit scale per block, which adds 0.25 bits of overhead per element. Total: 4.25 bits per element vs INT4’s 4.0 — a 6% storage premium for materially better accuracy.
The MX family was deliberately co-designed across hardware vendors so software ecosystems wouldn’t fragment.
XDL (matrix accelerator) instructions; throughput parity with FP8 on the same hardware.MXFP4 is one of several formats under the same MX umbrella, all sharing the 32-element-block + E8M0-scale chassis but varying the element format:
| Format | Element | Bits/element | Effective bits/element |
|---|---|---|---|
| MXFP4 | E2M1 | 4 | 4.25 |
| MXFP6 | E2M3 / E3M2 | 6 | 6.25 |
| MXFP8 | E4M3 / E5M2 | 8 | 8.25 |
| MXINT8 | INT8 | 8 | 8.25 |
MXFP6 is the under-rated middle child — it gives you 1.5 extra bits per element for ~2% storage premium and recovers nearly all the FP8 quality at much lower memory pressure. MXFP8 is what Blackwell uses for FP8 training (vs the H100-style monolithic FP8 with delayed scaling).
For inference: when you’re serving on Blackwell or MI300, MXFP4 is the new default for sub-8-bit weight quantization. Native hardware support means it’s strictly better than GPTQ-INT4 in both throughput and accuracy.
For training: MXFP4 training is research-grade as of 2026 — the recipes are not yet boringly reliable at the 70B+ scale. MXFP6 or MXFP8 is what you use for MX-format training; MXFP4 remains an inference format.
For non-NVIDIA hardware without native MX support: storage-only MXFP4 (dequant to FP8/BF16 in-kernel) still saves HBM but loses the throughput advantage. AWQ-INT4 remains competitive and often wins because the kernel ecosystem (vLLM, TensorRT-LLM) is more mature.
E8M0 is a power-of-2 scale — 8 exponent bits, zero mantissa bits, no sign. Dequantization becomes a bit-shift:
Trade-off between scale-overhead amortization and dynamic-range fidelity. Smaller blocks (16, 8) track outliers better — each block's scale is set by its own absmax — but every block costs 8 bits of scale storage, so the overhead grows. Larger blocks (64, 128) amortize scale storage but waste resolution when one outlier elevates the absmax for 127 well-behaved siblings. The OCP committee landed on 32 as the empirical knee.
On most LLaMA-class models at inference, MXFP4 lands within ~0.3 perplexity of FP16 versus ~0.5-1.0 for GPTQ-INT4 and ~0.3-0.5 for AWQ-INT4. The win comes from the per-32-element scale — a uniform INT4 with one per-channel scale wastes resolution on every block whose dynamic range is far from the channel-wide absmax. MXFP4's per-block scaling is always near-optimal for its block. The catch: MXFP4 needs hardware-native support to be performant; on hardware that has to emulate it, GPTQ/AWQ INT4 still win on throughput.