Why two FP8 formats instead of one?
The two roles in training have very different dynamic-range requirements. Forward activations span ~3 orders of magnitude in a healthy transformer; E4M3's max ~448 and min subnormal ~
Also known as: FP8, 8-bit float, E4M3, E5M2, FP8 inference, FP8 training
Two IEEE-style 8-bit float variants, E4M3 (1 sign, 4 exponent, 3 mantissa) and E5M2 (1 sign, 5 exponent, 2 mantissa). E4M3 has higher precision and narrower range, used for forward activations and weights.
FP8 is a family of 8-bit floating-point formats for deep learning, standardized in the 2022 NVIDIA + Arm + Intel “FP8 Formats for Deep Learning” white paper. Two variants share the FP8 byte: E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits). Both fit in a single byte; their roles in training are distinct.
E4M3 (forward / weights):
┌─┬─────┬─────┐
│S│ EEEE│ MMM │
└─┴─────┴─────┘
1 4 3
max ≈ ±448 min subnormal ≈ ±2^(-9)
E5M2 (backward / gradients):
┌─┬──────┬───┐
│S│EEEEE │MM │
└─┴──────┴───┘
1 5 2
max ≈ ±57344 min subnormal ≈ ±2^(-16)
E4M3 trades 1 exponent bit for 1 mantissa bit relative to E5M2. The result: E5M2 has 2× the exponent range (5 bits vs 4) at the cost of half the mantissa precision (2 bits vs 3). This is the load-bearing engineering trade in FP8 training.
For E4M3 (bias = 7):
For E5M2 (bias = 15):
Why this split: forward activations in a healthy transformer span ~3 decades (post-LayerNorm everything sits within
A useful mental model: E4M3 is “FP16 minus precision,” E5M2 is “FP16 minus range.” Both are derived from FP16 by giving up half the bits, and the choice of which half maps directly to where the format gets used.
Because FP8’s dynamic range is narrow (especially E4M3), every tensor that goes through FP8 needs a scaling factor: a per-tensor (or per-channel, or per-token) FP32 multiplier that maps the tensor’s actual values into FP8’s representable range. The recipe:
The choice of
| Hardware | E4M3 | E5M2 | Native matmul | Notes |
|---|---|---|---|---|
| H100 / H200 | ✓ | ✓ | ✓ | First widely-deployed FP8 hardware |
| B100 / B200 | ✓ | ✓ | ✓ | Plus native NVFP4 / MXFP4 |
| MI300X / MI325 | ✓ | ✓ | ✓ | OCP variant |
| Trainium2 | ✓ | partial | partial | E4M3 inference; E5M2 training in software |
| Apple M4 | ✓ | ✗ | ✗ | Inference only via Metal |
FP8 was the major hardware-software co-design unlock of 2023-2024. The H100 + Transformer Engine combination made FP8 training a one-flag-flip from BF16, and the throughput gains (~2×) compounded with sequence-length packing, FlashAttention, and other systems wins to make 100B-class model training feasible on small-by-frontier-standards clusters. NVFP4 / MXFP4 are the 2025-2026 unlock that’s still finding its training recipe.
For inference, FP8 retains its role on hardware without native FP4 support — H100, H200, older MI-series. On Blackwell, NVFP4 wins on throughput and memory; FP8 only beats it when the model is unusually quality-sensitive (some reasoning models lose accuracy at FP4 in ways they don’t at FP8).
For training, FP8 is still the practical default in 2026 — full-FP4 training recipes are research-grade and not yet reliable at the 70B+ scale. The standard Blackwell training stack stores weights in NVFP4, runs forward in FP8 E4M3, and runs backward in FP8 E5M2. Pure FP4 training is the next frontier.
The two roles in training have very different dynamic-range requirements. Forward activations span ~3 orders of magnitude in a healthy transformer; E4M3's max ~448 and min subnormal ~
FP8 has narrow dynamic range, so per-tensor scaling is mandatory. The naive recipe — recompute amax (max absolute value) per layer per step before quantizing — would require a global synchronization barrier per layer, killing throughput. Delayed scaling computes the scale at step
FP8 was the inference + training default through 2024. NVFP4 / MXFP4 displace it for weights and forward activations in 2025+ — they're 2× more memory-efficient and Blackwell can compute on them natively. FP8 retains its role for gradients (E5M2's wider range is hard to replace at 4 bits) and for older hardware (H100/H200 don't natively support MX-family). The new training stack on Blackwell is NVFP4 weights + NVFP4 forward + FP8 backward — a precision tier per phase.