MFU is achieved FLOPs divided by theoretical peak FLOPs — the headline efficiency metric for whether you're actually using the GPU. Realistic targets in 2026: 40-60 percent during pretraining is good, 50 percent-plus is excellent.
Model FLOPs Utilization is the ratio of useful FLOPs your model actually executes per second to the GPU’s theoretical peak FLOPs per second. It was introduced in the PaLM paper (Chowdhery et al., 2022) and is now the standard headline efficiency metric for any large-scale training or serving workload. A team that says “we trained at 52 percent MFU on 4096 H100s” is making a precise claim — half the silicon’s compute capacity, sustained, end-to-end, no asterisks. A team that quotes raw tokens-per-second without an MFU number is hiding something.
The formula
For training,
where is the analytic FLOP count of one forward-and-backward pass over the global batch, is per-GPU peak (e.g., 989 TFLOPS at FP16 for H100), is wall-clock step time, and is the GPU count. The standard analytic for transformer FLOPs is for a model of parameters processing tokens — 2 FLOPs per parameter for forward, 2 for activation gradients, 2 for weight gradients (Kaplan et al. 2020 derivation).
For serving, the equivalent is achieved tokens/sec times FLOPs-per-token, divided by peak. Decode and prefill are reported separately because their MFU ceilings differ by an order of magnitude.
What’s a good number
Realistic 2026 targets
Pretraining, well-tuned, dense transformer: 40-55 percent MFU on H100s with bf16 and FlashAttention 2/3.
Fine-tuning with SFT or DPO: 35-50 percent if your batch is large enough to keep intensity up.
Inference prefill: 30-50 percent — high arithmetic intensity helps.
Inference decode, batch 32: 8-12 percent — capped by the bandwidth-bound ridge, not by laziness.
Naive eager-mode PyTorch, no kernel fusion, no FlashAttention: 3-8 percent, often single-digit on bigger models.
A 5x MFU gap between naive and well-engineered code is normal. That gap is most of the operating cost of a training run. A team that can’t articulate their MFU and what’s bottlenecking it is leaving a lot of money on the table.
Decode-step inference loads the full weight tensor per token (memory-bound), so its arithmetic intensity scales with effective batch size . The roofline ceiling on an H100 (ridge ~295 FLOPs/byte) is . At , that’s 11 percent — and you’ll hit maybe 8-10 percent of peak FLOPS in practice after kernel overheads, refspeculation, and scheduler costs. There is no kernel rewrite that breaks this ceiling; the ceiling is the architecture.
What does break it: raising intensity by other means. Quantizing weights to FP8 cuts the bytes loaded in half and doubles the ceiling for the same batch. GQA shrinks the KV-cache footprint and lets you batch more sequences at the same VRAM. Speculative decoding lets one weight load produce multiple tokens — increasing intensity multiplicatively. Each of these shows up as MFU rising 1.5-3x. The right framing for an inference team is “what raises my arithmetic intensity,” not “what makes my kernel faster.”
So decode MFU at 10 percent isn’t a failure to optimize. It’s the ceiling, and you raise the ceiling by changing the workload’s bytes-per-FLOP ratio, not by squeezing the existing one harder.
What MFU lets you diagnose
The diagnostic value is in the gap between your number and the ceiling for your shape:
Prefill at 25 percent on a 70B fp16 H100 run — likely missing FlashAttention or kernel fusion. Compute-bound regime; you should be at 40-50.
Decode at 4 percent on batch 32 — likely below the bandwidth-bound ceiling because of unmaterialized batching or KV-cache fragmentation. PagedAttention or larger batch will close the gap to ~10.
Training at 18 percent on a multi-node run — communication overhead from poorly-tuned tensor parallelism or pipeline parallelism . The chip is fine; the cluster topology is hurting you.
Single-digit MFU on a small model on a big GPU — the model is too small to saturate the device. Either swap to a smaller GPU or batch up.
Why anyone should care
Compute is the binding cost of frontier ML. The ratio between a 5 percent MFU run and a 50 percent MFU run is a 10x difference in dollar cost for the same model quality, the same training data, the same cluster size. Top labs publish their MFU numbers because it’s the legible signal that their systems team is competent. For an applied team, “what’s our MFU and why isn’t it higher” is the right question to be asking every quarter.
Go further
How is MFU computed exactly?
MFU = (model FLOPs per training step) / (peak hardware FLOPs x step time x num GPUs). Model FLOPs come from a forward+backward analytic count: ~6 x N x D for a transformer of N parameters processing D tokens (2x forward, 2x activation gradient, 2x weight gradient). Step time and GPU count are measured. The headline number is dominated by how close arithmetic intensity gets to the ridge point — high MFU requires both compute-bound shapes and well-tuned kernels.
Because decode is permanently memory-bound. The roofline ceiling at batch 32 on an H100 is around 32 / 295 ~ 11 percent of peak FLOPS, no matter how clever the kernel. Hitting 8-10 percent MFU during decode is actually well-optimized; the headline MFU numbers in the 40-60 percent range come from training and prefill, where intensity is high and the ceiling is the FLOPS cap rather than the bandwidth cap.
MFU counts only the analytic FLOPs the model needs to do (the forward-and-backward FLOPs at exact precision). HFU — Hardware FLOPs Utilization — counts every FLOP actually executed, including activation recomputation, redundant computation in tensor parallelism, and other overheads. HFU is always at least MFU; the gap is implementation overhead. PaLM and Megatron papers report both; MFU is the model-quality-per-FLOP-of-budget number, HFU is the chip-utilization number.