GPU kernel authoring is the practice of writing custom GPU code — in CUDA C++ , Triton, CUTLASS, or one of the newer DSLs — when the standard library kernels (cuBLAS, cuDNN, the operators that ship with PyTorch) are leaving meaningful performance on the table. It is the discipline that turned FlashAttention from a 2-line algorithmic equivalence into a 2-4x speedup that reshaped what was algorithmically feasible at long context. It is also, for most teams, the wrong first instinct: a profiler-first methodology and a clear-eyed cost-benefit analysis come before the editor.
When to write a kernel (and when not to)
The default answer is “don’t.” torch.compile, FlashAttention, vLLM’s PagedAttention, and the standard fused-op libraries cover most of what production ML serving needs. Writing your own kernel costs weeks of engineering time, locks you to a hardware generation (a kernel tuned for H100 will underperform on B200), and creates a maintenance burden that lasts as long as the kernel does.
You write a kernel when three conditions all hold:
The three-condition test
There is room above the floor. Profiler shows the op is at, say, 30% of HBM-bandwidth peak when the workload should be memory-bound — there’s a 3x ceiling sitting unused.
The op is hot. It’s in the top 10% of end-to-end runtime, or it’s blocking a tail-latency target. A 3x speedup on something that’s 1% of runtime is 0.3% — not worth the kernel.
The shape is stable. You’re going to run this kernel on the same shape (or a small set of shapes) for long enough to amortize autotuning. Kernels for one-off experimental shapes are negative-ROI.
When all three hold — attention kernels at long context, MoE expert dispatch, custom quantization formats, novel collectives — the payoff is real and durable. When one fails, lean on torch.compile and a library.
The DSL landscape in 2026
The trade between “control” and “productivity” has been re-cut several times.
Raw CUDA C++. Maximum control. Warp-level primitives, async-copy choreography, tensor-core mma instructions, register-level reuse. The path NVIDIA itself uses for cuBLAS / cuDNN / FlashAttention 3. Slow to write.
CUTLASS. A C++ template library for composing GEMM-shaped kernels from tiles, epilogues, and schedules. The path of choice when you need tensor-core peak on a non-standard matmul shape.
Triton. A Python-embedded DSL where you write tile-level code; the compiler handles thread mapping, coalescing, and shared-memory tiling. The dominant choice for new ML kernels in 2026 — torch.compile emits Triton, and most published “we wrote a custom kernel” papers ship Triton.
ThunderKittens, Mosaic, Mojo, etc. Newer entrants targeting the shapes Triton handles awkwardly (low-rank, irregular, async-pipelined). Smaller communities; sharper for the shapes they’re designed for.
The practical default: write it in Triton; drop to CUTLASS when you need tensor-core peak on a custom matmul; drop to raw CUDA C++ when Triton can’t express what you need (warp-specialized async pipelines, novel synchronization patterns).
The profiling-first methodology
Optimizing a kernel without a profiler is guessing. The discipline is:
Measure first. Run the workload under Nsight Compute (ncu) or Nsight Systems (nsys). Read the roofline page. The profiler tells you the binding constraint: memory bandwidth, tensor-core throughput, occupancy, branch divergence, shared-memory bank conflicts. Optimizing the wrong constraint produces no speedup.
Compute a ceiling. What’s the theoretical best this op could achieve? For a memory-bound op: bytes moved divided by HBM bandwidth. For a compute-bound op: FLOPs divided by tensor-core peak. If the current implementation is already at 80% of ceiling, stop — there’s nothing left to win without changing the algorithm.
Walk the optimization ladder. Each step has a profiler signature that confirms it landed.
For a kernel that does enough compute to plausibly engage tensor cores (matmul, attention, conv), the steps look roughly like this:
Naive. One thread per output element. Coalescing accidental. Typically 1-5% of peak. Useful as a correctness baseline.
Coalesced. Threads in a warp read consecutive memory. Profiler confirms one HBM transaction per warp load. Often 5-10x over naive. The single biggest single fix.
Tiled with shared memory. Load a block of inputs into shared memory once; have many threads reuse it. The arithmetic intensity rises from to . Profiler shows reduced HBM traffic per FLOP. Another 2-5x.
Tensor-core engaged. Replace the inner loop with wmma::mma_sync or mma.sync PTX. Operand fragments live in registers, accumulators in registers, only the tile boundary touches shared memory. This is where you cross from “fp16 GEMM at 30 TFLOPS” to “fp16 GEMM at 600 TFLOPS” on H100.
Software-pipelined. Use cp.async to issue the next tile’s HBM load while the current tile’s compute is in flight. The SM never stalls waiting on memory. Worth 1.5-2x once the kernel is otherwise well-tuned.
Autotuned. Sweep tile sizes, num warps, num pipeline stages, swizzle patterns. The best configuration depends on input shape, GPU generation, and how much L2 hot-data leaks across launches. Triton’s @triton.autotune decorator handles this; CUTLASS has cute::layout pickers; CUDA C++ usually rolls its own.
Skipping a step almost always means optimizing the wrong constraint. A kernel that hasn’t been coalesced and is “compute-bound” in the profiler is lying — the SMs are stalled on memory, but the profiler attributes the time to the cycle the math instruction issued, not to the wait that preceded it.
FlashAttention as the case study
The canonical example of why kernel authoring still matters is FlashAttention . Pre-FlashAttention, attention was implemented as three sequential kernels (QK matmul → softmax → V matmul) with the attention matrix materialized in HBM between them. At , that’s a 128 MB intermediate per head per layer, written and re-read from HBM. Attention was bandwidth-bound at roughly 30% of the roofline.
The FlashAttention insight: the attention matrix doesn’t have to exist. Tile and , compute partial dot products in SRAM, run an online softmax that maintains a running max and sum, and stream through to accumulate the output — never materializing the full . Same math, memory instead of , and arithmetic intensity rises into the compute-bound regime.
The result was 2-4x faster training and, more importantly, unbounded sequence length without quadratic memory. Long-context inference (32K, 128K, 1M tokens) is essentially impossible on real hardware without it. The algorithm hadn’t changed since 2017; reorganizing when each tensor lived in which memory was worth a multiplicative factor on the entire field.
The lesson generalizes. Most ML workloads are memory-bound, which means there’s almost always a kernel rewrite that moves a working set from HBM into SRAM and unlocks a multiplicative speedup. The bottleneck is finding it — knowing the algorithm well enough to spot the equivalent reformulation, and knowing the hardware well enough to know it’ll pay off.
Where the work actually goes
In practice, kernel authoring is 20% writing the kernel and 80% making it correct, fast, and stable. Numerical equivalence to the reference (within fp16 reorder-rounding) has to be tested on every shape. Performance has to be validated across input distributions, not just the benchmark shape. Autotuning has to be cached and invalidated correctly on driver upgrades. The kernel has to be exposed through a stable Python API that survives PyTorch version bumps.
Kernel authoring sits at the intersection of CUDA , fusion , arithmetic intensity , and mixed precision . It is rarely the right starting point for a performance problem — but for the few hot paths where it is, it remains the highest-leverage optimization in the catalogue.
Go further
When is it worth writing a custom kernel instead of using a library?
Three conditions, all of which need to hold: (1) a profiler shows the existing op is below its roofline ceiling, with measurable headroom; (2) the op is in the hot path — top 10% of total runtime, or you're chasing a tail-latency target; (3) the shape is stable enough that an autotuned, shape-specialized kernel will pay back its development cost. If any one fails, the answer is torch.compile and a library. The rare cases where all three hold — attention at long context, MoE expert dispatch, custom collectives, novel quantization formats — are where production teams still hand-write kernels.
Why has Triton become the default for new kernels?
Triton is a Python-embedded DSL where you write tile-level code — operations on blocks of values — and the compiler handles the thread-to-data mapping, coalescing, shared-memory allocation, and software pipelining. You give up some of the lowest-level control of CUDA C++ (warp specialization, async copy choreography), but you get a ~10x reduction in lines of code for a kernel that's typically within 10-20% of a hand-tuned CUDA version. For most ML workloads that's a winning trade. CUTLASS still wins for GEMM-shaped tensor-core peak; ThunderKittens targets the shapes neither cleanly handles.
What does the canonical kernel optimization journey look like?
Roughly: (1) write the naive kernel, one thread per output element, measure it. (2) Fix coalescing — make sure consecutive threads read consecutive memory. Often a 5-10x win. (3) Tile through shared memory — load a block once, reuse it across many threads. Another 2-5x. (4) Engage tensor cores via wmma or mma instructions on aligned 16x16 tiles. Big jump for matmul-shaped work. (5) Software pipelining — overlap async loads with compute so the SM never stalls on memory. (6) Autotune over tile sizes, num warps, num stages. Each step has its own Nsight metric to validate. Skipping a step usually means you're optimizing the wrong constraint.