Continuous Batching

Also known as: dynamic batching, in-flight batching, iteration-level scheduling

TL;DR

The vLLM-style scheduling trick where requests join and leave a batch in-flight, dynamically. Massively improves GPU utilization for variable-length generation compared to naive static batching, and is the default in every modern LLM serving stack.

Continuous batching is the scheduling discipline behind every modern LLM serving stack — vLLM, TGI, TensorRT-LLM, SGLang. It increases 5-20× over static batching by letting requests join and leave the in-flight batch every decode step.

The unit of scheduling is one token, not one request. That single shift is what unlocked production-scale LLM serving.

CONTINUOUS BATCHINGSlots refill at every token.CONTINUOUS BATCHINGGPU UTIL ≈ 90%slot 0slot 1slot 2slot 3ABCDEFE joinsF joinsSTATIC BATCHINGGPU UTIL ≈ 25%slot 0slot 1slot 2slot 3AIDLEBIDLECIDLEDbatch clearsDECODE STEPS (TOKENS) →

The problem with static batching

Pre-LLM ML serving used “static” or “dynamic” batching: collect requests, run inference on them as a batch, return all results. For ResNet image classification this is fine — every request takes the same time. For LLMs it’s terrible:

  • Generation lengths vary wildly. One request might generate 5 tokens; another in the same batch generates 500.
  • A static batch runs at the speed of the slowest request. Fast requests sit idle for the duration of the slow ones.
  • New incoming requests have to wait for the entire batch to finish before they can join.

GPU utilization falls off a cliff. Empirically, naive batching gets 10-30% utilization on real LLM traffic.

What continuous batching does

The scheduler operates at token-level granularity instead of request-level:

  1. Each step generates one token across the active batch. Standard transformer decode.
  2. After each step, the scheduler:
    • Removes any requests that just emitted their stop token / hit max length.
    • Pulls new requests from the queue and adds them to the next step’s batch.
    • Recomputes KV cache memory budget.

The result: requests join and leave the in-flight batch every token. Fast requests finish and free up slots immediately; new requests get into compute within ~1 token of arriving. GPU utilization climbs to 70-90%.

Why it required PagedAttention

Naive batching pre-allocates KV cache contiguous blocks per slot, sized for max generation length. Continuous batching breaks because (a) different requests have different current lengths, (b) requests come and go in mid-flight, (c) VRAM is the bottleneck and you can’t waste it on max-length pre-allocation.

vLLM’s PagedAttention (Kwon et al., 2023) treats the like virtual memory — request KV is split into fixed-size pages, allocated on demand, can be non-contiguous. Plus the attention kernel can read paged KVs without performance penalty. This enabled continuous batching at scale; the PagedAttention paper and the vLLM release were essentially the same artifact.

The latency-throughput tradeoff, revisited

Continuous batching dramatically improves both throughput and latency compared to static batching, because:

  • New requests don’t wait for batch boundaries — they start within one token.
  • Slot turnover means overall queue depth stays low.

You still trade throughput for latency by tuning max concurrent batch size — bigger means more throughput, smaller means lower per-request latency. But the Pareto frontier sits much further out than naive batching’s frontier.

Where the technique matters for retrieval

reranker serving has the same shape as LLM serving — variable-length pairs (a query with candidate documents), per-call forward pass, latency-sensitive. The same engineering pattern applies. For non-LLM components ( , search) continuous batching is unnecessary — those are short, deterministic-cost calls and naive batching works.

A request entering the batch has two phases. Prefill processes the entire input prompt in one parallel pass — compute-bound, fast per-token, builds the initial KV cache. Decode generates output one token at a time — memory-bound, much slower per-token, appends to the KV cache.

Continuous batching has to schedule both. The naive policy (always run prefill before decode) starves new requests waiting behind a long prefill; the opposite (always run decode first) makes prefill latency unbounded. Real schedulers chunk prefill into the same step as decode — vLLM’s “chunked prefill” splits a long prompt into pieces and interleaves prefill chunks with decode of other requests in the batch.

The result is a single fused step that mixes a few decoding requests with a slice of one prefilling request. The math gets ugly — different attention shapes per request, different memory access patterns — but the GPU stays saturated and tail latency stays bounded.

Prefix caching stores the KV pages for repeated prompt prefixes — a long system prompt shared across requests, a few-shot template, an agent’s stable instruction block. A request hitting the cache skips prefill on those tokens entirely; the scheduler attaches the cached pages and starts decode immediately.

This compounds with continuous batching. The prefill phase is the expensive one for long prompts; cache-hit requests skip it and slot into decode-only batches with millisecond latency. The throughput math improves dramatically — for an agent loop with a 10K-token system prompt and 200-token completions, prefix caching turns each call into a tiny decode-only step, and the batch can hold many more concurrent requests in the same VRAM budget.

The interaction is the design principle behind vLLM’s whole stack: PagedAttention enables prefix sharing (pages are reference-counted), continuous batching exploits it, and prompt-cache-aware request routing keeps cache hit rates high across replicas.

Go further

Why is naive batching so bad for LLMs?

Generation lengths are unpredictable and highly variable. Static batching (collect requests, run them together until all finish) means the whole batch runs at the speed of the longest request. The 99th-percentile-long request blocks 31 short ones. GPU utilization drops as fast requests finish and idle slots accumulate.

What does 'iteration-level' scheduling mean?

After every decode step (generating one token across the batch), the scheduler can drop finished requests, add new ones from the queue, and resume — all without disturbing the running KV caches. This is the unit of scheduling: one token, not one request.

What's PagedAttention and why does it matter for batching?

vLLM's PagedAttention treats KV cache like virtual memory — non-contiguous, paged blocks per request. Without it, you'd need to pre-allocate max-length contiguous KV cache per request slot, wasting most of VRAM. With it, you allocate KV cache as needed, and continuous batching can pack many more concurrent requests into the same VRAM.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord