FP8

Also known as: FP8, 8-bit float, E4M3, E5M2, FP8 inference, FP8 training

TL;DR

Two IEEE-style 8-bit float variants, E4M3 (1 sign, 4 exponent, 3 mantissa) and E5M2 (1 sign, 5 exponent, 2 mantissa). E4M3 has higher precision and narrower range, used for forward activations and weights.

FP8 is a family of 8-bit floating-point formats for deep learning, standardized in the 2022 NVIDIA + Arm + Intel “FP8 Formats for Deep Learning” white paper. Two variants share the FP8 byte: E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits). Both fit in a single byte; their roles in training are distinct.

Bit layout

E4M3 (forward / weights):
  ┌─┬─────┬─────┐
  │S│ EEEE│ MMM │
  └─┴─────┴─────┘
   1   4    3
   max ≈ ±448      min subnormal ≈ ±2^(-9)

E5M2 (backward / gradients):
  ┌─┬──────┬───┐
  │S│EEEEE │MM │
  └─┴──────┴───┘
   1   5    2
   max ≈ ±57344    min subnormal ≈ ±2^(-16)

E4M3 trades 1 exponent bit for 1 mantissa bit relative to E5M2. The result: E5M2 has 2× the exponent range (5 bits vs 4) at the cost of half the mantissa precision (2 bits vs 3). This is the load-bearing engineering trade in FP8 training.

For E4M3 (bias = 7):

  • Max: — but the spec clamps to 448 to reserve a NaN encoding.
  • Min normal: .
  • Min subnormal: .
  • Effective dynamic range: ~5 decades.

For E5M2 (bias = 15):

  • Max: .
  • Min normal: .
  • Min subnormal: .
  • Effective dynamic range: ~10 decades.

Why this split: forward activations in a healthy transformer span ~3 decades (post-LayerNorm everything sits within typically). Gradients span ~10 decades because they grow with depth and shrink with the chain rule — backprop through 80 layers can produce gradients that span to . E4M3 can’t represent the small gradients (underflow); E5M2 can.

A useful mental model: E4M3 is “FP16 minus precision,” E5M2 is “FP16 minus range.” Both are derived from FP16 by giving up half the bits, and the choice of which half maps directly to where the format gets used.

Per-tensor scaling and the FP8 recipe

Because FP8’s dynamic range is narrow (especially E4M3), every tensor that goes through FP8 needs a scaling factor: a per-tensor (or per-channel, or per-token) FP32 multiplier that maps the tensor’s actual values into FP8’s representable range. The recipe:

  1. Cast input to FP8 with scale : clamped to FP8’s representable range.
  2. Compute in FP8: matrix multiplication accumulates in higher precision (typically FP32 for the accumulator) but reads inputs as FP8.
  3. Scale back on output: .

The choice of is everything. Too large and most values quantize to zero; too small and the largest values clip. The standard approach is where is the maximum absolute value of .

Where each variant gets used

The standard FP8 transformer recipe
  • Weights: E4M3, with per-tensor or per-channel scales. Static — computed once at quantization time.
  • Forward activations: E4M3, with delayed-scaling per-tensor scales. The matmul accumulator stays in FP32; the input/output traffic is FP8.
  • Gradients (backward): E5M2, with delayed-scaling per-tensor scales. The wider exponent range is essential for gradients.
  • KV cache: typically E4M3 with per-token scales — recovers most of the perplexity at half the memory of BF16 KV cache.
  • Optimizer state (Adam moments): typically BF16 or FP32. FP8 first-moment is borderline; second-moment underflows. This is one of the last holdouts of higher precision in modern training.

Hardware support

HardwareE4M3E5M2Native matmulNotes
H100 / H200First widely-deployed FP8 hardware
B100 / B200Plus native NVFP4 / MXFP4
MI300X / MI325OCP variant
Trainium2partialpartialE4M3 inference; E5M2 training in software
Apple M4Inference only via Metal

FP8 was the major hardware-software co-design unlock of 2023-2024. The H100 + Transformer Engine combination made FP8 training a one-flag-flip from BF16, and the throughput gains (~2×) compounded with sequence-length packing, FlashAttention, and other systems wins to make 100B-class model training feasible on small-by-frontier-standards clusters. NVFP4 / MXFP4 are the 2025-2026 unlock that’s still finding its training recipe.

When to choose FP8 over the FP4 family

For inference, FP8 retains its role on hardware without native FP4 support — H100, H200, older MI-series. On Blackwell, NVFP4 wins on throughput and memory; FP8 only beats it when the model is unusually quality-sensitive (some reasoning models lose accuracy at FP4 in ways they don’t at FP8).

For training, FP8 is still the practical default in 2026 — full-FP4 training recipes are research-grade and not yet reliable at the 70B+ scale. The standard Blackwell training stack stores weights in NVFP4, runs forward in FP8 E4M3, and runs backward in FP8 E5M2. Pure FP4 training is the next frontier.

Go further

Why two FP8 formats instead of one?

The two roles in training have very different dynamic-range requirements. Forward activations span ~3 orders of magnitude in a healthy transformer; E4M3's max ~448 and min subnormal ~ accommodates that with 3 bits of mantissa precision. Gradients during backprop span ~10 orders of magnitude — they grow into the layers and can underflow easily — so the wider E5M2 range (max 57344, min ) trumps mantissa precision. Trying to use E4M3 for both makes gradients underflow; trying to use E5M2 for both wastes precision on activations. The split is empirical and load-bearing.

What is delayed scaling and why does it matter?

FP8 has narrow dynamic range, so per-tensor scaling is mandatory. The naive recipe — recompute amax (max absolute value) per layer per step before quantizing — would require a global synchronization barrier per layer, killing throughput. Delayed scaling computes the scale at step from amax statistics gathered at step (or from a windowed history of recent steps). This avoids the barrier; the cost is occasional first-occurrence outliers cause overflow, mitigated by amax-history-windowed maxima. Transformer Engine implements this transparently.

How do MXFP4 / NVFP4 change FP8's role?

FP8 was the inference + training default through 2024. NVFP4 / MXFP4 displace it for weights and forward activations in 2025+ — they're 2× more memory-efficient and Blackwell can compute on them natively. FP8 retains its role for gradients (E5M2's wider range is hard to replace at 4 bits) and for older hardware (H100/H200 don't natively support MX-family). The new training stack on Blackwell is NVFP4 weights + NVFP4 forward + FP8 backward — a precision tier per phase.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord