NVIDIA's parallel-computing platform for GPUs. A C++ extension plus a runtime that exposes the GPU as a SIMT machine: thousands of threads grouped into warps, blocks, and grids, with explicit control over a tiered memory hierarchy.
CUDA is NVIDIA’s parallel-computing platform: a C++ language extension, a compiler (nvcc), a runtime, and a set of libraries (cuBLAS, cuDNN, NCCL) that together expose the GPU as a programmable parallel processor. Every PyTorch tensor operation, every Triton kernel, every JAX jit eventually compiles to CUDA’s intermediate representation (PTX) and runs as a CUDA kernel on the device. Understanding CUDA is the foundation under kernel authoring , fusion , and most performance debugging on NVIDIA hardware.
The execution model: SIMT
A GPU is not a CPU with more cores. CPUs run a small number of independent threads at full speed each; GPUs run thousands of threads at modest speed each, in lockstep groups. CUDA’s programming model formalizes this as SIMT — single-instruction, multiple-thread.
You launch a grid of blocks of threads. Threads within a block can synchronize and share fast on-chip memory; blocks are independent. Each block is scheduled onto a streaming multiprocessor (SM) and executed as warps of 32 threads. Every thread in a warp runs the same instruction at the same cycle — branching where threads disagree (warp divergence) serializes the paths.
The hierarchy at a glance
Thread. The unit of execution. Has its own registers and program counter. Indexed by threadIdx.
Warp. 32 threads executing in lockstep. Not a CUDA-language concept directly — you can’t declare warp size — but every performance question routes through it.
Block. A group of up to 1024 threads scheduled together on one SM. Threads in a block share __shared__ memory and can call __syncthreads(). Indexed by blockIdx.
Grid. The full launch — many blocks, scheduled across all SMs. Blocks cannot synchronize with each other (without device-wide barriers, which are expensive).
A typical kernel launch looks like:
__global__ void add(float* a, float* b, float* c, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) c[i] = a[i] + b[i];}add<<<gridDim, blockDim>>>(a, b, c, n);
The triple-bracket syntax is CUDA’s launch operator. Each thread computes one output element; the grid is sized to cover all n elements with some tail handling.
The memory spaces
CUDA exposes the GPU memory hierarchy as named address spaces, and the __global__, __shared__, __constant__ keywords let you place data deliberately.
Registers. Per-thread, single-cycle. The compiler allocates them; you influence count via the unroll factor and the live-variable count. Spilling to local memory (which is HBM, despite the name) is a perf cliff.
Shared memory. Per-block, programmer-managed SRAM. ~228 KB per SM on H100. Used for tiles, reductions, lookup tables. Bank conflicts are the failure mode.
Global memory. HBM. The big slow pool every kernel reads from and writes to. Coalesced accesses (consecutive threads → consecutive addresses) are mandatory for performance.
Constant memory. A small (64 KB) cached read-only space. Useful when every thread reads the same value at the same time.
Texture / surface memory. Specialized cached read paths for spatial-locality access patterns. Rarely relevant in modern ML.
HBM is accessed in 32-, 64-, or 128-byte transactions. When a warp issues a load, the memory subsystem looks at the 32 addresses requested and decides how many transactions are required to satisfy them. If the 32 addresses fall inside one 128-byte cache line — i.e., consecutive threads load consecutive 4-byte floats from a 128-byte-aligned base — the entire warp’s load is one transaction. If the addresses are scattered, you can pay up to 32 transactions for the same 32 floats.
The bandwidth difference is exactly that 32x. A coalesced kernel hits memory-bound peak; an uncoalesced kernel runs at 3% of HBM bandwidth and looks “compute-bound” in profilers because the SMs are simply waiting on memory. The fix is not a faster algorithm — it’s restructuring the access pattern (transposing data, padding rows to avoid bank conflicts, swapping threadIdx.x over the inner dimension) so that the warp’s natural stride matches the cache line.
This is why CUDA kernels are written “thread-per-output-element” with the output’s contiguous dimension on threadIdx.x. It is not a stylistic choice; it is the only access pattern that achieves bandwidth.
Occupancy and divergence
Two SIMT-specific failure modes show up constantly.
Occupancy is the ratio of active warps on an SM to the maximum it can support. It’s bounded by registers per thread, shared-memory per block, and threads per block — whichever runs out first. Low occupancy means the SM has too few warps to hide HBM latency: when one warp stalls on a load, there’s no other warp ready to run, and the SM idles. The fix is usually reducing per-thread register pressure (don’t unroll too aggressively) or shrinking shared-memory tiles.
Warp divergence happens when threads in the same warp take different branches. The hardware serializes the paths — each thread executes its branch with the others masked off — so a fully divergent if/else runs at half speed; nested divergence compounds. The fix is reorganizing data so threads in a warp share control flow: sort inputs by branch outcome, pad to uniform shapes, or hoist the branch outside the kernel.
Streams, events, and synchronization
CUDA is asynchronous by default. A kernel launch returns immediately to the host; the GPU work queues up on a stream and runs in order. Different streams can run concurrently — overlapping compute with a host-to-device memcpy, or running two independent kernels in parallel — which is how production inference engines pipeline prefill and decode.
Synchronization comes in flavors: cudaDeviceSynchronize() blocks the host until all GPU work finishes (heavy hammer); cudaStreamSynchronize(stream) blocks until one stream drains; CUDA events let you record fine-grained timing or set up cross-stream dependencies. Every “the GPU disagrees with what I just wrote” bug is a missing synchronization.
Where CUDA sits in 2026
Raw CUDA C++ is no longer the only path to a GPU kernel. The 2026 stack looks more like:
PyTorch eager mode — calls into prebuilt cuDNN / cuBLAS / NCCL kernels under the hood. 90% of users never see CUDA.
torch.compile + Inductor — emits Triton kernels that compile to PTX. Covers most fusion wins automatically.
Triton — a Python-embedded DSL where you write tile-level code and the compiler handles thread mapping, coalescing, and shared-memory allocation. The dominant choice for new ML kernels.
CUTLASS — NVIDIA’s C++ template library for GEMM-shaped kernels. The path of choice when you need tensor-core peak.
Raw CUDA C++ — still where the deepest optimization happens (FlashAttention 3, custom collectives, the kernels Triton can’t yet express).
Higher-level DSLs raise the floor; raw CUDA still defines the ceiling. Every Triton kernel emits PTX; every PTX runs on the same SIMT hardware described here. Triton hides the indexing math, but it does not hide warp divergence, occupancy, or coalescing — it just reframes them. Engineers who skip CUDA and learn only Triton hit a ceiling the first time their kernel disagrees with the profiler about why it’s slow.
CUDA is the substrate. Everything else in performance engineering — fusion, mixed precision, FlashAttention, tensor-core utilization — is, mechanically, a question about which threads, which memory, which warp, which SM. Learn the model once and the rest of the catalogue stops being a bag of tricks and starts being a coherent story about parallelism on real silicon.
Go further
What does SIMT mean and how is it different from SIMD?
SIMD (single-instruction multiple-data) executes one instruction across a fixed-width vector register — AVX-512 does 16 float32 lanes per instruction. SIMT (single-instruction multiple-thread) is NVIDIA's spin: the same instruction issues across 32 threads called a warp, but each thread has its own register file and program counter. The practical difference is that warps tolerate divergent control flow — an if/else where some threads take the branch and others don't — by serializing the divergent paths and masking inactive lanes. SIMD would simply not let you write that code. The cost is real: a fully divergent warp runs at 1/32 of peak.
Why is coalesced memory access more important than algorithmic cleverness?
When a warp issues a load, the hardware tries to coalesce the 32 thread-level addresses into the minimum number of HBM transactions — ideally a single 128-byte burst. If thread reads address , the load is one transaction. If thread reads base + i * 1024, it's 32 separate transactions and the kernel runs ~30x slower. Most naive CUDA code is fast enough algorithmically and slow because of access pattern. The senior-engineer move on a slow kernel is check coalescing first, optimize the algorithm second.
Should I still learn raw CUDA C++ in 2026 when Triton exists?
For most ML engineers, no — Triton, CUTLASS templates, and torch.compile cover 95% of the wins with a fraction of the effort. But CUDA remains the lingua franca: every higher-level DSL eventually emits PTX, every error message you read references CUDA primitives, and the few kernels that genuinely need hand-tuning (FlashAttention, MoE dispatch, custom collectives) are still written in CUDA C++. Knowing enough CUDA to read a kernel and reason about warp behavior is closer to mandatory than optional for performance work.