PyTorch Internals

Q: What's the dispatcher and why does it matter?

The dispatcher is the C++ machinery that turns a + b into the correct kernel call. When you write a + b, PyTorch consults a table keyed on the (op, dtype, device, layout, autograd-mode) tuple and picks the kernel — fp16 CUDA add, int8 CPU add, sparse-tensor add, autograd-enabled wrapper, etc. The dispatcher is also where autograd hooks itself in: there's a separate dispatch key for Autograd that, when set, wraps the kernel call in tape-recording machinery before delegating to the actual backend. Custom ops, vmap, functorch transforms, and meta tensors all extend the dispatcher with new keys. If you understand the dispatch table, you can read the stack trace of any PyTorch op and predict exactly which kernel ran.

Q: What does torch.compile actually do?

torch.compile replaces eager-mode op-by-op execution with graph-capture-then-optimize. TorchDynamo intercepts Python bytecode at function-call boundaries and traces through the function, building an FX graph of the tensor operations while specializing on shapes and Python control flow. That FX graph is handed to a backend — by default Inductor — which lowers it to Triton kernels (for GPU) or C++ kernels (for CPU), fusing pointwise ops, scheduling reductions, and inserting memory-layout transforms. The first call is slow (compilation); subsequent calls with the same shape signatures hit the cached kernel and run 30-100% faster than eager. The trade is debug-friendliness — you lose the per-op stack trace and gain a graph that's harder to introspect.

Q: How does autograd's tape work in practice?

During the forward pass, every op on a tensor with requires_grad=True appends a node to the autograd graph: the node holds a reference to the inputs (or the saved subset needed for the backward) and a function that computes the gradient with respect to each input. The graph is dynamic — it's rebuilt every forward pass — which is what enables data-dependent control flow without a graph compiler. Calling .backward() traverses the graph in reverse topological order, calls each node's gradient function, and accumulates gradients into the leaf .grad attributes. Memory-wise, the saved tensors dominate: every intermediate that's needed for the backward stays alive until backward completes, which is why activations are the dominant memory cost during training and why gradient checkpointing is so useful.

Also known as: PyTorch, autograd internals, torch.compile

TL;DR

How PyTorch actually executes a forward pass: a torch.Tensor is a thin Python object wrapping a storage, view, dtype, and device; every op routes through a C++ dispatcher that picks the right backend kernel for the (dtype, device, layout) tuple.

PyTorch is the dominant ML framework not because it’s the fastest — it’s not, in any specific benchmark — but because it’s the most legible. A torch.Tensor is a thin Python wrapper over a clearly-bounded set of C++ structures; every op routes through a dispatcher you can introspect; the autograd graph is built and walked in front of you at runtime. Once you understand how a single forward pass executes end-to-end, you can debug 90% of training bugs without print.

This article walks the stack from the Python torch.Tensor down through the dispatcher, into the kernels, through autograd, and out into torch.compile’s graph capture. It is the load-bearing mental model for everything else in the performance-engineering catalogue.

What a `torch.Tensor` actually is

A torch.Tensor is a Python object that holds essentially four things:

A Storage. A flat 1D buffer of bytes — on CPU or GPU memory — that holds the actual numeric data. Multiple tensors can share a storage (a slice or a transpose creates a new view, not a new buffer).
A view. Shape, strides, and an offset into the storage. The strides tell the dispatcher how to walk the storage to produce the logical multi-dimensional tensor; the offset says where the view starts.
A dtype. fp32, fp16, bf16, int8, etc. Determines how the bytes are interpreted.
A device. cpu, cuda:0, mps, xla, etc.

Why this decomposition matters

a.transpose(0, 1) doesn’t copy memory. It returns a new tensor with the same storage and swapped strides. O(1).
a.contiguous() does copy. It allocates a new storage with stride pattern matching the shape’s natural row-major order. O(n).
a.view(...) requires the underlying memory to be contiguous in the new shape’s stride pattern. If not, it errors and asks you to call .contiguous() first.
a.to('cuda') allocates a new storage on the GPU and copies. The view, dtype, and shape come along.
a + b allocates a new storage for the result; the inputs are unchanged. (a += b mutates in place.)

This is the “view + storage” model that makes PyTorch’s slicing, reshaping, and broadcasting cheap — most operations that look like reshapes are zero-copy stride manipulations. It also explains a class of subtle bugs: a tensor that’s a non-contiguous view of a larger storage holds the whole storage alive, even after you “slice off” a small piece. Memory leaks at the model-checkpoint scale almost always trace to this.

The dispatcher

Every operation on a tensor — a + b, torch.matmul(a, b), a.relu() — routes through PyTorch’s dispatcher. The dispatcher is a C++ multi-dimensional lookup table keyed on a tuple of dispatch keys: the dtype, the device, the layout (dense vs sparse), the autograd mode, and a stack of “transformation” keys for things like vmap, functorch, and tracing.

When you call a + b:

The Python-side __add__ calls into C++.
The dispatcher inspects a and b, computes the active dispatch keys.
It walks down a priority order — autograd first (if either input requires grad), then any active transformation keys (vmap, autocast for mixed precision), and finally the backend key (CUDA, CPU, MPS).
At each level the registered kernel runs and either consumes the result or delegates further down.

This is one of the more elegant tricks in PyTorch. Autograd is not a special phase that runs before or after your op — it’s a kernel registered against the Autograd dispatch key that wraps the call to the actual backend kernel. When requires_grad=True is in the tensor’s autograd-key set, the dispatcher routes to the autograd wrapper; the wrapper saves the inputs (or the relevant subset), records a node on the autograd graph, then dispatches again with autograd disabled, this time hitting the actual CUDA / CPU kernel.

This is why you can write a custom op once and it automatically works under autograd, vmap, autocast, and tracing — you just register at the right keys. It’s also why the backward can be slower than the forward: every saved tensor lives in HBM until backward completes, doubling the memory footprint of an op. And it’s why torch.no_grad() is a real perf win — it removes the autograd key from the dispatch path entirely, skipping all the tape-recording machinery.

When you read a stack trace and see frames like at::native::add_kernel_cuda, that’s the bottom of the dispatcher landing at the actual implementation. Knowing the dispatch order lets you predict which kernel ran for any op.

Autograd: the dynamic tape

PyTorch’s autograd is define-by-run: the computational graph for the backward pass doesn’t exist before the forward; it’s built incrementally as the forward executes. Every op records a node in the graph: a function that computes gradients with respect to its inputs, plus references to whatever tensors it saved for that purpose.

Calling loss.backward() traverses the graph in reverse topological order, calls each node’s gradient function with the upstream gradient, and accumulates the resulting gradients into leaf tensors’ .grad attribute.

Two consequences worth internalizing:

The graph is rebuilt every forward. This is what lets PyTorch handle data-dependent control flow, dynamic shapes, and Python-level branching without a separate graph compilation step. It’s also why eager-mode PyTorch can’t fuse across op boundaries — by the time op N runs, op N+1 hasn’t been seen yet.
Saved tensors dominate training memory. The forward pass holds onto every intermediate it’ll need for backward. For a transformer, that’s the attention activations, the FFN intermediates, every residual sum — typically 4-8x the model parameter footprint. This is what gradient checkpointing attacks: drop the saved tensors, recompute them in the backward.

Eager vs `torch.compile`

The biggest internal change of the last few years is the shift from eager-only execution to graph capture + compilation.

In eager mode, every op is a separate dispatcher call, a separate kernel launch, with full Python-level interactivity. Insert a print anywhere — it works. Set a debugger breakpoint — it works. Branch on a tensor value — it works. The cost is per-op kernel-launch overhead and zero cross-op fusion.

In torch.compile mode:

TorchDynamo intercepts Python bytecode at function-call boundaries and traces through the function. It builds an FX graph of tensor operations while specializing on shapes and Python control flow. When it hits a “graph break” (a Python construct it can’t trace), it falls back to eager and starts a new graph after.
AOTAutograd captures the backward graph alongside the forward, so both are visible to the optimizer.
Inductor (the default backend) lowers the FX graph to optimized kernels: pointwise ops fused into Triton kernels, reductions scheduled, memory-layout transforms inserted, then compiled to PTX (GPU) or C++ (CPU).
The compiled artifact is cached on the function and on the shape signature; subsequent calls with matching shapes hit the cache and skip recompilation.

The first call is slow (compilation). Subsequent calls run 30-100% faster than eager for typical transformer training, more for inference. The trade is observability — you can no longer set a breakpoint inside the compiled graph without a graph break — and a class of “this only fails under torch.compile” bugs around shape specialization and dynamic control flow.

torch.fx is the underlying graph IR that everything sits on. You can capture an FX graph manually, transform it (insert quantization observers, replace ops, fuse subgraphs), and recompile — this is how most production quantization tooling and custom optimization passes are built.

The Python ↔ C++ ↔ CUDA boundary

PyTorch is three languages stacked:

Python is the user-facing API: torch.Tensor, nn.Module, optim.Adam, the autograd front-end. Most of what users write.
C++ is the implementation: the dispatcher, the autograd engine, the storage and view machinery, the operator schemas, the bulk of kernels for CPU. The C++ API (LibTorch) is also exposed for embedded inference.
CUDA / Triton / MPS / etc. are the backend kernels: the actual numeric work, called by the dispatcher’s leaf kernels. cuBLAS, cuDNN, NCCL, FlashAttention all sit here.

The Python ↔ C++ boundary is where most of the per-op overhead lives — typically 1-5 microseconds per dispatch. For a model with 10,000 ops per forward pass that’s 10-50 ms of pure overhead. This is why kernel fusion (and the graph compilers that produce it) matter so much: every fused op is a Python ↔ C++ transition saved.

The mental model that pays back the most: every PyTorch op is (view of storage on device, dispatcher route, kernel call, optional autograd tape append). When something’s wrong — a memory leak, a slow op, a gradient that won’t flow — that decomposition tells you which level to investigate. Storage shared with a long-lived tensor? Dispatcher routing to an unexpected backend? Kernel that’s CPU when you expected GPU? Autograd graph holding tensors you thought were freed? Every one of those is a one-line diagnostic once you know where to look.

PyTorch’s internals are unusually approachable for a system this load-bearing — the C++ source is readable, the dispatcher is introspectable, and the FX graph is a real Python object you can print. Spending a day reading them once pays back across every fusion , mixed-precision , and distributed debugging session for the rest of a career.

Go further

What's the dispatcher and why does it matter?

The dispatcher is the C++ machinery that turns a + b into the correct kernel call. When you write a + b, PyTorch consults a table keyed on the (op, dtype, device, layout, autograd-mode) tuple and picks the kernel — fp16 CUDA add, int8 CPU add, sparse-tensor add, autograd-enabled wrapper, etc. The dispatcher is also where autograd hooks itself in: there's a separate dispatch key for Autograd that, when set, wraps the kernel call in tape-recording machinery before delegating to the actual backend. Custom ops, vmap, functorch transforms, and meta tensors all extend the dispatcher with new keys. If you understand the dispatch table, you can read the stack trace of any PyTorch op and predict exactly which kernel ran.

Tensor

What does `torch.compile` actually do?

torch.compile replaces eager-mode op-by-op execution with graph-capture-then-optimize. TorchDynamo intercepts Python bytecode at function-call boundaries and traces through the function, building an FX graph of the tensor operations while specializing on shapes and Python control flow. That FX graph is handed to a backend — by default Inductor — which lowers it to Triton kernels (for GPU) or C++ kernels (for CPU), fusing pointwise ops, scheduling reductions, and inserting memory-layout transforms. The first call is slow (compilation); subsequent calls with the same shape signatures hit the cached kernel and run 30-100% faster than eager. The trade is debug-friendliness — you lose the per-op stack trace and gain a graph that's harder to introspect.

Inference graph compilation Kernel fusion

How does autograd's tape work in practice?

During the forward pass, every op on a tensor with requires_grad=True appends a node to the autograd graph: the node holds a reference to the inputs (or the saved subset needed for the backward) and a function that computes the gradient with respect to each input. The graph is dynamic — it's rebuilt every forward pass — which is what enables data-dependent control flow without a graph compiler. Calling .backward() traverses the graph in reverse topological order, calls each node's gradient function, and accumulates gradients into the leaf .grad attributes. Memory-wise, the saved tensors dominate: every intermediate that's needed for the backward stays alive until backward completes, which is why activations are the dominant memory cost during training and why gradient checkpointing is so useful.

Backpropagation Gradient checkpointing

← All concepts