PagedAttention

Also known as: paged attention, paged-attention, vLLM PagedAttention

TL;DR

PagedAttention is the KV-cache memory manager behind vLLM. It treats the KV cache like an OS treats process memory — fixed-size blocks of 16 tokens, mapped through a per-sequence block table.

PagedAttention is the memory-management layer that turns the KV cache from the dominant bottleneck of LLM serving into a near-fully-utilized resource. Introduced in Kwon et al. (2023) alongside vLLM, it borrows operating-system virtual-memory ideas wholesale: the cache lives in fixed-size physical blocks (default 16 tokens per layer), each sequence holds a block table that maps logical positions to physical blocks, and blocks are allocated on demand from a global pool. Memory utilization rises from 30-40 percent to roughly 96 percent, batch sizes 2-4x, and prefix sharing across requests becomes a refcount on a page table rather than a deep architectural change.

The fragmentation problem it solves

A naive KV-cache implementation reserves a contiguous slab per active sequence sized to the maximum allowed length — say max_seq_len * 2 * num_layers * num_kv_heads * d_head * dtype_bytes. A request with a 100-token answer in an 8K-context server holds onto an 8K-shaped slab for its entire lifetime; the unused tail is stranded. Across a batch with heterogeneous request shapes, you waste 60-70 percent of VRAM on slack the scheduler can’t reclaim. The achievable batch size is gated not by what the GPU could compute but by what fits.

The wall is severe at scale. A 70B model at 8K context with 32 in-flight sequences pre-allocates roughly 80 GB of KV before a single useful token has landed in it.

The OS analogy is exact

The mapping is tighter than most “X is just like Y” pitches. Logical sequence positions correspond to virtual addresses; physical KV blocks correspond to physical memory pages; the per-sequence block table is the page table. Allocations happen lazily as the sequence grows. Refcounted shared pages give you copy-on-write semantics, which the prefix cache reuses directly. Even the typical block size — 16 tokens — is chosen the same way OS designers picked 4 KB pages: small enough that internal fragmentation is bounded, large enough that the indirection table doesn’t dominate.

What is genuinely different: there is no swap-out (the disk is too slow for the decode-step deadline), there are no permissions or address-space isolation (one process, one tenant), and the granularity is per-layer rather than per-process. PagedAttention also has to expose the indirection to a custom attention kernel, since stock cuBLAS and FlashAttention 2 expect contiguous K/V tensors. The vLLM PagedAttention kernel is the piece that makes the indirection invisible to the rest of the stack.

What unlocks once memory is paged

Three things, all of them load-bearing for production serving:

Larger achievable batch sizes. With 96 percent utilization vs 35 percent, the same VRAM holds 2-4x as many concurrent sequences. Throughput scales nearly linearly with batch in the memory-bound regime, so batch headroom translates directly into tokens-per-second-per-GPU.
Prefix sharing as page-table sharing. When two sequences fork from a shared system prompt or a shared retrieved context, their block tables point to the same physical blocks until the first diverging token. This is the mechanism behind prompt caching in vLLM and SGLang. Beam search and parallel sampling fall out of the same primitive.
No more max-length tax. A request that finishes in 50 tokens releases its blocks back to the pool immediately. A long-context request grows page by page until it stops. The scheduler stops over-provisioning for the worst case.

What follows from it

PagedAttention is the foundation that vLLM builds the rest of its stack on. Continuous batching is straightforward when memory is paged — you don’t need to repack a contiguous tensor when a sequence finishes mid-batch, you just free its blocks. Multi-LoRA serving, FP8 KV-cache quantization, structured-output decoding, draft-and-target speculative decoding all integrate without redesigning the memory layer. Most modern serving engines (SGLang, TensorRT-LLM, MLC) ship some equivalent. PagedAttention is now table stakes; the design choice that’s interesting in 2026 is what additional layer (prefix radix-tree, RadixAttention, hierarchical paging across HBM and host) sits on top of it.

Go further

Why 16 tokens per block and not 4 or 256?

Block size trades fragmentation against bookkeeping. Smaller blocks waste less memory at sequence end but spend more bandwidth on the page-table indirection per attention step; larger blocks shorten the table but waste a half-block per sequence on average. 16 is the empirical sweet spot Kwon et al. landed on for typical workloads. Some serving stacks ship 32 for very long contexts, where the tail-fragmentation cost is amortized over more useful tokens.

KV cache vLLM serving

Does PagedAttention slow down attention itself?

Marginally. Each attention step pays one block-table lookup per sequence per layer to find the K/V pages. vLLM's custom CUDA kernel collapses the indirection into a single coalesced load and overlaps it with compute, so the per-step overhead is in the low single-digit percent. The throughput gain from the larger achievable batch size dominates the indirection cost by 5-10x in any realistic workload.

Attention GPU memory hierarchy

Can blocks be shared across requests?

Yes — that is exactly how prefix caching works. Two sequences with the same opening tokens point their early entries in the block table at the same physical KV blocks; the manager refcounts pages and copy-on-writes the first block where the sequences diverge. This is what makes a shared 4K system prompt across 1000 concurrent agent requests cost the prefill of one request, not 1000.

Prompt caching vLLM serving

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs