Why is naive batching so bad for LLMs?
Generation lengths are unpredictable and highly variable. Static batching (collect
Also known as: dynamic batching, in-flight batching, iteration-level scheduling
The vLLM-style scheduling trick where requests join and leave a batch in-flight, dynamically. Massively improves GPU utilization for variable-length generation compared to naive static batching, and is the default in every modern LLM serving stack.
Continuous batching is the scheduling discipline behind every modern LLM serving stack — vLLM, TGI, TensorRT-LLM, SGLang. It increases throughput 5-20× over static batching by letting requests join and leave the in-flight batch every decode step.
The unit of scheduling is one token, not one request. That single shift is what unlocked production-scale LLM serving.
Pre-LLM ML serving used “static” or “dynamic” batching: collect
GPU utilization falls off a cliff. Empirically, naive batching gets 10-30% utilization on real LLM traffic.
The scheduler operates at token-level granularity instead of request-level:
The result: requests join and leave the in-flight batch every token. Fast requests finish and free up slots immediately; new requests get into compute within ~1 token of arriving. GPU utilization climbs to 70-90%.
Naive batching pre-allocates KV cache contiguous blocks per slot, sized for max generation length. Continuous batching breaks because (a) different requests have different current lengths, (b) requests come and go in mid-flight, (c) VRAM is the bottleneck and you can’t waste it on max-length pre-allocation.
vLLM’s PagedAttention (Kwon et al., 2023) treats the KV cache like virtual memory — request KV is split into fixed-size pages, allocated on demand, can be non-contiguous. Plus the attention kernel can read paged KVs without performance penalty. This enabled continuous batching at scale; the PagedAttention paper and the vLLM release were essentially the same artifact.
Continuous batching dramatically improves both throughput and latency compared to static batching, because:
You still trade throughput for latency by tuning max concurrent batch size — bigger means more throughput, smaller means lower per-request latency. But the Pareto frontier sits much further out than naive batching’s frontier.
Cross-encoder reranker serving has the same shape as LLM serving — variable-length pairs (a query with
A request entering the batch has two phases. Prefill processes the entire input prompt in one parallel pass — compute-bound, fast per-token, builds the initial KV cache. Decode generates output one token at a time — memory-bound, much slower per-token, appends to the KV cache.
Continuous batching has to schedule both. The naive policy (always run prefill before decode) starves new requests waiting behind a long prefill; the opposite (always run decode first) makes prefill latency unbounded. Real schedulers chunk prefill into the same step as decode — vLLM’s “chunked prefill” splits a long prompt into pieces and interleaves prefill chunks with decode of other requests in the batch.
The result is a single fused step that mixes a few decoding requests with a slice of one prefilling request. The math gets ugly — different attention shapes per request, different memory access patterns — but the GPU stays saturated and tail latency stays bounded.
Prefix caching stores the KV pages for repeated prompt prefixes — a long system prompt shared across requests, a few-shot template, an agent’s stable instruction block. A request hitting the cache skips prefill on those tokens entirely; the scheduler attaches the cached pages and starts decode immediately.
This compounds with continuous batching. The prefill phase is the expensive one for long prompts; cache-hit requests skip it and slot into decode-only batches with millisecond latency. The throughput math improves dramatically — for an agent loop with a 10K-token system prompt and 200-token completions, prefix caching turns each call into a tiny decode-only step, and the batch can hold many more concurrent requests in the same VRAM budget.
The interaction is the design principle behind vLLM’s whole stack: PagedAttention enables prefix sharing (pages are reference-counted), continuous batching exploits it, and prompt-cache-aware request routing keeps cache hit rates high across replicas.
Generation lengths are unpredictable and highly variable. Static batching (collect
After every decode step (generating one token across the batch), the scheduler can drop finished requests, add new ones from the queue, and resume — all without disturbing the running KV caches. This is the unit of scheduling: one token, not one request.
vLLM's PagedAttention treats KV cache like virtual memory — non-contiguous, paged blocks per request. Without it, you'd need to pre-allocate max-length contiguous KV cache per request slot, wasting most of VRAM. With it, you allocate KV cache as needed, and continuous batching can pack many more concurrent requests into the same VRAM.