vLLM Serving

Also known as: vLLM, PagedAttention, vLLM inference

TL;DR

vLLM is the dominant open-source LLM serving framework. Its core innovations — PagedAttention for KV-cache memory management, continuous batching for throughput, and prefix caching for prompt reuse.

vLLM is the high-throughput LLM inference engine that became the open-source default for serving large language models. Originating in Kwon et al. (2023, “Efficient Memory Management for Large Language Model Serving with PagedAttention”), vLLM stacks three core ideas — PagedAttention for KV-cache memory, continuous batching for compute, and prefix caching for prompt reuse — into a single serving engine that handles 2-10x more throughput than naive HuggingFace generate() on the same hardware (up to 24x on the original paper high-concurrency benchmarks).

Why naive serving fails

A transformer’s KV cache grows linearly with sequence length. A 70B model with 8K context and 32 concurrent sequences uses ~80GB just for KV. Naive implementations pre-allocate worst-case memory per slot — — even if every sequence is 100 tokens long. Memory utilization drops to 30-40%, batch size is forced down, and GPUs sit idle waiting on memory rather than crunching tokens.

This is the bottleneck vLLM is built around.

PagedAttention: KV cache as virtual memory

Operating systems handle process memory by paging — break virtual address space into fixed-size pages (typically 4KB), use a page table to map them to physical RAM, fragment-free. The exact same idea applied to KV-cache memory:

Break sequence KV into fixed-size blocks (typically 16 tokens per block per layer).
Maintain a per-sequence block table mapping logical sequence positions to physical KV blocks.
Allocate blocks on-demand as a sequence grows.
Sequences sharing a prompt prefix can share blocks (the prefix-caching layer).

Two consequences:

No internal fragmentation. A sequence of length 100 uses ~7 blocks (≈112 tokens), not the worst-case max-length allocation. Memory utilization rises from ~35% to ~96%.
Copy-on-write sharing. When two sequences fork from the same prefix (parallel sampling, beam search, prefix caching), they can share KV blocks until they diverge. This is what powers prefix caching for repeated system prompts.

The cost is one indirection per attention computation — the GPU has to look up the block address before reading KV. vLLM’s custom CUDA kernel hides that latency well; the throughput gain from larger batches dominates.

Continuous batching: don’t wait for slow sequences

Continuous batching is the second pillar. Naive batching waits for every sequence in the batch to finish before starting the next batch — but generation is autoregressive and sequences finish at wildly different times. A batch of 32 with one slow sequence wastes 31 slots for the duration.

Continuous batching swaps in new requests at every token step. As soon as a sequence finishes, its slot is replaced with a queued request. Combined with PagedAttention (which allows variable per-slot memory), this keeps GPU utilization near 100% even with heterogeneous request shapes.

Prefix caching: reuse the system prompt

If 1000 requests share the same 4K-token system prompt, naive serving recomputes the prefix’s KV 1000 times. vLLM’s prefix caching uses PagedAttention’s block-sharing to compute the prefix once and reuse it across all requests sharing it.

In practice this is a 30-90% latency reduction on the first token (TTFT) for high-concurrency workloads with shared prompts — agent workloads, RAG with templated system messages, structured-output pipelines. See caching strategies for the broader cache hierarchy.

What composes well on top

Examples

Speculative decoding — vLLM supports draft+target speculative decoding for a 1.5-3x latency win on memory-bound serving.
Tensor parallelism — vLLM handles TP across multi-GPU nodes natively; pipeline parallelism is also supported but with lower utilization.
LoRA serving — multi-LoRA serving lets you serve hundreds of LoRA adapters from one base model with ~zero per-adapter overhead.
Quantization — AWQ, GPTQ, FP8, INT8 KV-cache quantization all integrate with PagedAttention.
Structured output — JSON-schema and grammar-constrained decoding via outlines or xgrammar integrate as a logits processor.

When vLLM is the wrong choice

Secondary gotchas: prefix caching only helps if prefixes actually repeat (random prompts kill it), quantization can degrade quality on math-heavy or long-tail tasks, and multi-LoRA serving has per-adapter overhead at high QPS.

Operational shape

A typical vLLM deployment: one node with 4-8 GPUs running tensor-parallel, fronted by a load balancer that routes by model and by prefix-hash (to maximize prefix-cache hit rate). Throughput KPIs are tokens/sec/GPU and request P99 latency tail . Memory KPI is KV-cache utilization (target ~90%+).

vLLM is the default for teams shipping LLM-driven products because it’s the engine where the broadest ecosystem of optimizations lands first.

Go further

What is PagedAttention actually doing differently?

Standard KV-cache allocates contiguous memory per sequence sized to max-length, wasting memory on short sequences. PagedAttention treats the KV cache like virtual memory — allocates fixed-size blocks (typically 16 tokens) and uses a page table to map sequence positions to blocks. Result: ~96% memory utilization vs ~30-40% naive, enabling much larger batches.

KV cache Continuous batching

vLLM vs TensorRT-LLM vs SGLang — when to use which?

vLLM: open-source default, best ecosystem, supports nearly every model. TensorRT-LLM: NVIDIA's offering, fastest single-stream latency on H100/H200 but lock-in to NVIDIA stack. SGLang: faster prefix caching and structured-output decoding, smaller community. For most teams, vLLM is the right starting point.

Throughput Speculative decoding

Does vLLM help for batch size 1?

Less than for high concurrency. PagedAttention shines when you batch many concurrent sequences with varying lengths; at batch size 1, the memory wins are smaller. For pure single-stream latency, TensorRT-LLM or hand-tuned kernels often win. vLLM is built for serving many concurrent users.

Latency tail Cost per token

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs