Performance Engineering
Squeezing throughput, latency, and memory out of GPUs.
Modern AI is an exercise in moving tensors through GPU memory faster than anyone else. The concepts below cover the hardware abstraction (GPU memory hierarchy — HBM/SRAM/registers, arithmetic intensity, the roofline model), the parallelism strategies that let single jobs span multiple devices (tensor / pipeline / data parallel, FSDP), the inference-time tricks that compound (PagedAttention, kernel fusion, graph compilation, mixed precision), and the metrics that tell you whether you're actually using the silicon (MFU, throughput, latency tail). This is where the engineering depth of an AI team shows.
- Arithmetic Intensity
Arithmetic intensity is FLOPs per byte read from memory. Combined with the hardware's compute-to-bandwidth ratio it determines whether a kernel is memory-bound (intensity below the ridge) or compute-bound (above).
- AWQ — Activation-Aware Weight Quantization
INT4 weight-only quantization that protects the *salient* weight channels — the ones multiplied by large activations — by absorbing a per-channel scale into the weights *before* rounding.
- CUDA Programming
NVIDIA's parallel-computing platform for GPUs. A C++ extension plus a runtime that exposes the GPU as a SIMT machine: thousands of threads grouped into warps, blocks, and grids, with explicit control over a tiered memory hierarchy.
- Data Parallelism
Replicate the model on every GPU, shard the batch across replicas, and synchronize gradients with an allreduce after each backward pass. The simplest distributed-training pattern, and the default for any model that fits on a single device.
- FP8
Two IEEE-style 8-bit float variants, E4M3 (1 sign, 4 exponent, 3 mantissa) and E5M2 (1 sign, 5 exponent, 2 mantissa). E4M3 has higher precision and narrower range, used for forward activations and weights.
- FSDP (Fully Sharded Data Parallel)
FSDP shards parameters, gradients, and optimizer state across data-parallel ranks instead of replicating them. Each rank holds only 1/N of the weights at rest and gathers full layers on the fly during forward and backward.
- GGUF and the K-Quant Family
The file format and quantization scheme that powers `llama.cpp` — the de-facto local-inference stack for LLMs on commodity hardware. GGUF embeds tokenizer, chat template, and quantized weights in a single mmap-able artifact.
- GPTQ — Hessian-Based Post-Training Quantization
Layer-by-layer 4-bit weight quantization that minimizes layer-output reconstruction error using a Hessian computed from a small calibration set.
- GPU Kernel Authoring
Writing custom GPU kernels — in CUDA C++, Triton, CUTLASS, or ThunderKittens — when the off-the-shelf library is leaving performance on the table.
- GPU Memory Hierarchy
Modern GPUs have three relevant memory levels: HBM (slow, abundant), SRAM (fast, tiny), and registers (fastest, tiniest). HBM bandwidth is roughly 1-3 TB/s; on-chip SRAM bandwidth is closer to 20 TB/s.
- Gradient Accumulation
Run multiple micro-batches sequentially, summing their gradients into a buffer, before applying a single optimizer step. Lets you simulate a large effective batch on memory-constrained hardware. The standard trick for hitting target batch sizes when a single batch won't fit on the GPU.
- Gradient Checkpointing
Trade compute for memory by recomputing forward activations during the backward pass instead of storing them. Roughly 5x memory savings on activations at a cost of ~30% slower training.
- Inference Graph Compilation
Capture a model's computation as a static graph, optimize it (operator fusion, constant folding, attention specialization, kernel selection), and emit a compiled artifact that runs without Python overhead. torch.compile, TensorRT-LLM, ONNX Runtime.
- Kernel Fusion
Combining multiple GPU operations into a single CUDA kernel call so that intermediate tensors live in registers or shared memory instead of round-tripping through HBM.
- Mixed-Precision Training
Train with bf16 or fp16 activations and weights instead of fp32, while keeping master weights and optimizer accumulations in fp32 for numerical stability.
- Model FLOPs Utilization (MFU)
MFU is achieved FLOPs divided by theoretical peak FLOPs — the headline efficiency metric for whether you're actually using the GPU. Realistic targets in 2026: 40-60 percent during pretraining is good, 50 percent-plus is excellent.
- Model Quantization
Compressing the weights of a trained model from fp16 / bf16 down to int8, int4, fp8, or fp4 representations to fit larger models on smaller hardware and increase inference throughput.
- MXFP4
The Open Compute Project's microscaling 4-bit float. Blocks of 32 elements share a single 8-bit power-of-2 scale (E8M0); each element is a 4-bit micro-float (E2M1). Effective storage: 4.25 bits per weight.
- NF4 — NormalFloat 4-Bit Quantization
A 4-bit weight format with 16 levels placed at the *equiquantiles* of the standard normal distribution rather than uniformly. Trained-network weights are approximately $\mathcal{N}(0, \sigma^2)$, so spending bits where the mass actually lives.
- NVFP4
NVIDIA's variant of microscaling FP4, introduced with Blackwell. Blocks of 16 elements (vs MXFP4's 32) with an E4M3 FP8 scale (vs MXFP4's power-of-2 E8M0), plus an optional outer FP32 scale across multiple blocks.
- PagedAttention
PagedAttention is the KV-cache memory manager behind vLLM. It treats the KV cache like an OS treats process memory — fixed-size blocks of 16 tokens, mapped through a per-sequence block table.
- Pipeline Parallelism
Pipeline parallelism splits a model's layers across GPU stages and feeds micro-batches through them in an assembly-line schedule. GPipe (2018) introduced the basic idea; 1F1B (PipeDream, 2019) reduced its memory footprint.
- PyTorch Internals
How PyTorch actually executes a forward pass: a `torch.Tensor` is a thin Python object wrapping a storage, view, dtype, and device; every op routes through a C++ dispatcher that picks the right backend kernel for the (dtype, device, layout) tuple.
- Quantization-Aware Training (QAT)
Training a model with quantization simulated in the forward pass so the weights co-adapt to low-precision rounding. Recovers quality that post-training quantization loses, at the cost of a fine-tuning run. The standard recipe for sub-4-bit deployment.
- Tensor Parallelism
Tensor parallelism splits each individual matrix multiplication across multiple GPUs — column-split or row-split with an all-reduce — so weights too large for one GPU's memory still produce one combined result.
- Foundations 48
The bedrock primitives every other topic builds on.
- Data 18
The corpora, curation, and quality decisions that make models possible.
- Language Models 32
The foundational substrate of modern AI.
- Multimodal 13
When text isn't the only signal — vision, audio, and joint embedding spaces.
- Prompting 16
How you talk to an LLM, and when you stop.
- Agents 12
When LLMs become decision-makers in a loop.
- Search & Retrieval 21
How systems find relevant documents in the first place.
- Embeddings 16
The dense-vector layer of modern retrieval.
- Rerankers 9
The second stage that puts the right answer at the top.
- Evaluation 21
How to measure retrieval quality and trust the numbers.
- Training Methodology 21
How modern retrieval models get their relevance signal.
- Production 16
From notebook to live traffic.
