Kernel Fusion

Also known as: op fusion, operator fusion, fused kernels

TL;DR

Combining multiple GPU operations into a single CUDA kernel call so that intermediate tensors live in registers or shared memory instead of round-tripping through HBM.

Kernel fusion combines what would otherwise be a sequence of GPU kernel launches into a single fused kernel that does all of the work without round-tripping intermediate tensors through HBM. It is the single most important class of GPU performance optimization for LLM workloads, and the principle that explains why FlashAttention is fast, why torch.compile beats eager-mode PyTorch, and why hand-rolled CUDA / Triton kernels still earn their keep in production inference engines.

The premise is mechanical: each kernel launch on a GPU loads its inputs from HBM (GPU main memory), does its compute, and writes its output back to HBM. HBM is roughly an order of magnitude slower than on-chip SRAM and registers. Modern GPUs deliver several hundred FLOPS for every byte loaded from HBM — meaning any compute pattern that produces small intermediate tensors (elementwise ops, small reductions, layer-by-layer activations) is memory-bound, not compute-bound. The compute units sit idle waiting on memory.

Fusion fixes that by keeping intermediate values on-chip across multiple “logical” operations.

What fusion looks like in practice

Take the unfused sequence for a SwiGLU-style FFN block:

gate = linear_gate(x) — matmul, write to HBM.
up = linear_up(x) — matmul, write to HBM.
act = silu(gate) — elementwise, read gate from HBM, write to HBM.
mul = act * up — elementwise, read both from HBM, write to HBM.
out = linear_down(mul) — matmul, read mul from HBM, write to HBM.

Five kernels, five round-trips. Three of those — the elementwise SiLU, the multiply, and (if you are clever) the bias add inside the down-projection — are pure memory traffic with negligible compute.

A fused SwiGLU kernel does this:

Compute gate = linear_gate(x) and up = linear_up(x) into SRAM tiles.
While the tile is still in SRAM, apply silu(gate) * up.
Stream the result directly into linear_down.

One kernel, one round-trip. Same math, 1.5-2x faster, and the intermediates never touch HBM. Production transformer kernels — Triton-based fused FFNs, fused RMSNorm-residual-attention blocks, NVIDIA TransformerEngine — are all variations on this idea.

Why it matters in inference

Every GPU has a roofline defined by two numbers: peak FLOPS (compute ceiling) and peak HBM bandwidth (memory ceiling). For an op with arithmetic intensity (FLOPs per byte loaded), the achievable throughput is min(peak_FLOPS, I * peak_bandwidth).

On an H100 (~990 TFLOPS bf16, ~3.35 TB/s HBM), the crossover point is FLOP per byte. Anything below that is memory-bound; anything above is compute-bound.

Plain elementwise ops have . Layer norms, softmaxes, biases — all . They are catastrophically memory-bound on their own. A standalone elementwise kernel on H100 runs at roughly 1% of peak FLOPS.

Matmuls are different — a matrix-matrix multiply of dimension has , well above the crossover for any reasonable shape. The matmul-heavy steps (attention QK, FFN linears) hit the compute ceiling. The elementwise glue between them does not.

Kernel fusion’s purpose is to “absorb” memory-bound elementwise ops into the surrounding compute-bound matmuls. The matmul is going to load and use those bytes anyway; doing the SiLU / RMSNorm / bias-add on the values while they are still in SRAM is essentially free. This is why every modern transformer kernel is a fused matmul-plus-everything-elementwise.

In LLM serving, where every microsecond of latency matters, fusion is the table-stakes optimization. vLLM , TensorRT-LLM, and SGLang all ship fused attention, fused FFN, fused norm-and-residual kernels as their defaults. Naive PyTorch attention vs FlashAttention is the headline 2-4x; fused vs unfused FFN is another 1.5x; fused RMSNorm-residual-attention boundaries are another 10-20%. These compose multiplicatively.

Where fusion wins (and where it does not)

Fusion patterns that pay off

Elementwise + matmul. Fused into the matmul epilogue (bias, activation, dropout). Standard pattern.
Attention as a single kernel. FlashAttention. Avoids materializing the N x N attention matrix.
Norm + residual + projection. Common in transformer blocks; saves 1-2 HBM round-trips per layer.
Optimizer steps. Adam’s per-parameter m, v, param updates fused into one kernel — Apex FusedAdam, PyTorch’s _fused=True optimizers — saves multiple HBM passes during training.
MoE routing + grouped GEMM. Fused gating + dispatch in modern MoE kernels.

Fusion has limits. You cannot fuse across two large matmuls if the intermediate is itself big — there is not enough SRAM to hold it. You cannot fuse if the second op needs the full output of the first (e.g., a softmax needs the full row sum). FlashAttention’s online softmax is exactly the trick that made attention fusable — and it took years of research to find.

How it composes with graph compilation

Hand-writing fused kernels is expensive — Triton, CUDA, careful tile-size tuning. Most teams do not. The dominant production path is to lean on graph compilation (torch.compile, TensorRT-LLM, JAX jit) to fuse for you, and write hand-rolled kernels only for the few hot paths where the compiler leaves significant performance on the table. That covers ~80% of the throughput wins in practice; the remaining ~20% is what FlashAttention / vLLM / TensorRT-LLM make their living on.

Kernel fusion is the lens through which “performance engineering” becomes a coherent discipline rather than a bag of tricks. Every other optimization in the perf-eng topic — graph compilation, mixed precision, FlashAttention, fused optimizers — is, structurally, a different way to fuse more work into fewer HBM round-trips.

Go further

What's the arithmetic-intensity intuition?

Modern GPUs do roughly 200-300 FLOPS per byte loaded from HBM. Any op with arithmetic intensity below that ratio is memory-bound — the GPU's compute units sit idle waiting on memory. Fusion raises arithmetic intensity by reusing each loaded value across more computations before it is evicted from on-chip memory. A fused (load, multiply, add, store) is one HBM round-trip; the unfused version is three. Same total compute, three times the memory traffic.

Throughput

Why doesn't naive PyTorch fuse automatically?

Eager mode PyTorch executes each op as a separate CUDA kernel launch — that is what gives it Python-level interactivity and immediate error messages. Each kernel launch reads inputs from HBM, writes outputs to HBM, and has fixed launch overhead (~5 microseconds). torch.compile, TorchInductor, and CUDA Graphs all exist to fuse where eager mode cannot. The trade is debug-friendliness for throughput.

Inference graph compilation

What is the canonical example of fusion paying off?

FlashAttention. Naive attention runs softmax, scaling, and the QK / SV matmuls as separate kernels with the N x N matrix materialized in HBM between them. FlashAttention fuses all of it into a single tiled kernel where the N x N matrix never leaves SRAM. Same math, 2-4x faster, and unbounded context length becomes feasible. Every other fusion follows the same pattern with smaller payoffs.

FlashAttention

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs