Arithmetic Intensity

Also known as: roofline model, FLOPs per byte, operational intensity

TL;DR

Arithmetic intensity is FLOPs per byte read from memory. Combined with the hardware's compute-to-bandwidth ratio it determines whether a kernel is memory-bound (intensity below the ridge) or compute-bound (above).

Arithmetic intensity is the ratio of arithmetic operations performed to memory bytes read — FLOPs per byte. It’s the single number that decides, on any modern accelerator, whether your kernel is bottlenecked by compute throughput or by memory bandwidth. Williams, Waterman, and Patterson’s 2009 roofline model formalized it: plot achievable performance against intensity on log-log axes and you get a piecewise curve — bandwidth times intensity in the memory-bound region, then a flat ceiling at peak FLOPS once you cross the ridge point. Almost every LLM serving conversation collapses to where on the roofline a given kernel sits.

The formal picture

For a kernel that reads bytes from main memory and performs floating-point operations, arithmetic intensity is

The hardware exposes peak compute throughput (FLOPS) and peak memory bandwidth (bytes/s). Achievable performance for the kernel is bounded by

The crossover — the ridge — is at . Below , performance is regardless of how good your scheduling is. Above , you can saturate the FLOPS but never exceed them. An H100’s ridge sits near 295 FLOPs/byte at FP16; an A100 around 200; consumer cards somewhat lower. The ridge is where compute capacity exactly matches bandwidth capacity.

What pushes a kernel one way or the other

Two levers, both in the operator’s hands:

Reuse the same operand many times while it’s in fast memory. A GEMM that loads each weight once and uses it M times has intensity proportional to M. Tiling, register blocking, and SRAM-resident accumulators are all expressions of this.
Reduce the bytes read per FLOP. Quantization (fp16 to fp8 to int4) literally divides intensity’s denominator. So does grouped-query attention for attention’s K/V loads.

For autoregressive decode at batch size over a 70B-parameter model, every token loads the entire weight set (~140 GB at fp16) once. Per loaded byte you do roughly multiply-adds (each weight gets used once per sequence in the batch). Intensity = . On an H100 with ridge at 295, you’d need batch size 295 just to leave the memory-bound region — which is well past the KV-cache memory ceiling for typical context lengths.

FlashAttention raises the intensity of the attention block specifically by keeping K, V tiles in SRAM and reusing them across query rows — its tile-level intensity is much higher than naive attention. But the rest of the model (the linear projections, the FFN) is still bound by HBM weight loads, and that’s where most of the bytes are. The model-level intensity is dominated by the weight-loading term, which is why batching is the only real lever at decode.

The ridge gap is also why FP8 KV-cache quantization gives a near-2x decode throughput win on memory-bound serving, while it doesn’t change a thing for compute-bound prefill: cutting bytes by 2x doubles intensity in the regime where intensity is the binding constraint.

The numbers for typical LLM kernels

Order-of-magnitude intensities you should have memorized:

Prefill GEMMs at 4K context, 70B model: intensity in the 100s. Compute-bound; runs at 60-80 percent of peak FLOPS.
Decode GEMMs at batch 1: intensity ~1. Memory-bound; runs at ~1 percent of peak FLOPS.
Decode GEMMs at batch 32: intensity ~32. Still memory-bound (ridge is ~295 on H100), but 30x faster per token than batch 1.
FlashAttention prefill at long context: intensity 50-200 depending on tile shape. Approaches compute-bound.
Layernorm, softmax, elementwise ops: intensity 1-3. Always memory-bound; only saved by kernel fusion.

Why this is the right mental model

When you hear “memory-bound,” “bandwidth-bound,” or “compute-bound” in performance discussions, they all reduce to position on the roofline. Modern hardware ratios — H100, MI300X, TPU v5 — sit between 200 and 400 FLOPs/byte at the ridge. The numbers shift with each generation but the shape is permanent: there will always be a regime where bandwidth is the binding constraint and a regime where compute is, and the ratio between them is the binding architectural fact about the chip you’re buying. MFU is the natural follow-up — what fraction of the ceiling you actually achieved.

Go further

What's the ridge point on an H100?

Peak FP16 throughput is ~989 TFLOPS; HBM bandwidth is ~3.35 TB/s. Ridge point = 989 / 3.35 ~ 295 FLOPs per byte. Any kernel below that intensity is bandwidth-bound and cannot exceed (intensity x bandwidth) FLOPS regardless of how cleverly you schedule it. Above it, you're compute-bound and limited by tensor-core throughput. Decode-step LLM workloads sit at intensity 1-4, deep in the memory-bound region.

GPU memory hierarchy MFU

How do you compute arithmetic intensity for a GEMM?

For a matmul of shapes (M, K) by (K, N) producing (M, N): FLOPs = 2 x M x N x K. Bytes read at fp16 = 2 x (M x K + K x N + M x N). Intensity = M x N x K divided by (M x K + K x N + M x N). For square M = N = K = 1024 that's ~341 — comfortably compute-bound. For decode-step shapes, M = batch x seqlen = 1, and intensity collapses toward 1.

Throughput GPU memory hierarchy

Is LLM serving permanently memory-bound?

Decode is. Prefill isn't — it processes hundreds to thousands of tokens in parallel and hits compute-bound territory at intensity 100-plus. Decode runs at intensity 1-4 even with batching, because the KV cache reads grow with sequence length. The only way to push decode toward the ridge is batch size: M increases linearly with concurrent sequences, intensity rises with it, until KV-cache memory caps further batching.

KV cache Continuous batching

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs