Caching Strategies

Also known as: LLM caching, response caching, prompt caching, semantic cache

TL;DR

Three layers of caching for LLM-driven systems: exact-match (request → response), prompt-prefix (KV cache reuse for shared prefixes), and semantic (similar-query reuse via embeddings). Each helps different production workloads in different ways.

Caching is the cheapest cost-reduction lever in any LLM-driven system. Three distinct layers target different parts of the cost surface — they compose, and modern production stacks use all three.

1. Exact-match (request) caching

The simplest: hash (model, prompt, params), store the response. On a cache hit, skip the LLM entirely.

Latency impact. Massive — milliseconds vs seconds.
Cost impact. 100% — no inference happens.
Hit rate. Low for chat (2-10%), higher for FAQ-style traffic (30-60%).
Failure mode. Stale data — if the underlying knowledge changes (e.g., RAG docs updated), cached responses are wrong. Use TTLs and cache invalidation.

Best for: deterministic prompts, stable underlying data, high-traffic queries.

2. Prompt-prefix caching (KV cache reuse)

Most production prompts have a long stable prefix (system prompt, role definition, tools, examples) and a short variable suffix (user message). The KV cache computed for the prefix can be reused across requests — the API surface for this is the cached-input-tokens billing line on every major provider, the serving-stack term is automatic prefix caching in vLLM / TGI.

Best for: agent loops, RAG with stable system prompts, anything with a long prefix shared across many requests. Hit rates of 60-90% on stable system prompts make this a near-universal win.

For the API/TTL specifics, billing multipliers, and minimum-prefix sizes per provider, see prompt caching .

3. Semantic caching

Cache responses keyed by the embedding of the prompt rather than its hash. Two queries with different wording but similar meaning hit the same cache entry.

Best for: high-volume customer-service traffic, FAQ-style RAG, internal knowledge-base lookups. The win is real on the right workload; the silent-wrong-answer trap is also real and has bitten production teams.

For the threshold-tuning recipe, the false-positive trap, and how to validate before shipping, see semantic cache .

How they compose

Production retrieval stack typically layers all three:

Exact-match cache first. Fastest, cheapest, lowest-risk. Catches the obvious cases.
Semantic cache second. Catches paraphrased/synonymous queries. Higher risk, do offline validation.
Prefix cache always on. Free quality-of-life improvement; no risk if the provider supports it.

Where retrieval-specific caching helps

For RAG pipelines, you can also cache the retrieval layer — the top- documents for a given query. Combined with exact-match caching of the LLM response, this gives a fast path for repeat queries even when the LLM response would have been recomputed for any reason. For high-traffic FAQ-style applications, this is often the dominant cost win.

The structural reason: prefix caching wins when many requests share a stable prompt prefix, even when the suffix varies. In a typical production system, the system prompt, role definitions, tool schemas, and few-shot examples make up the first 80-95% of the input tokens, while the user’s actual message is just the last 5-20%. Every request to that endpoint shares those bulk tokens.

Request caching, by contrast, requires the entire prompt (system + suffix) to match exactly. Two users asking the same question with different conversation histories will miss the request cache despite having identical intents. The hit rate ceiling is governed by how much variation exists in the suffix space, which in real traffic is high.

The practical consequence: prefix caching is a near-universal win whenever your provider supports it (Anthropic, OpenAI, Google all do). Request caching is workload-dependent and most useful for read-mostly endpoints (FAQ bots, classifiers, deterministic transformations). Semantic caching sits between them — same hit-rate ceiling as prefix caching for similar-intent queries, but with the silent-wrong-answer risk that prefix caching doesn’t have.

Go further

What's the difference between prompt-prefix caching and KV cache reuse?

They're the same thing, framed differently. Prompt-prefix caching (Anthropic / OpenAI billing concept) means you pay less for input tokens that appeared in a recent request. Mechanically, the provider keeps the KV cache from your previous request alive and reuses it — that's the actual cost saving. The serving stack term is 'prefix caching'; the API term is 'prompt caching'. Same trick.

KV cache Cost per token

When does semantic caching help vs hurt?

Helps when traffic has many semantically similar but textually different queries (FAQ, support tickets, repeated information requests). Hurts when subtle wording matters (legal queries, code questions where a typo changes meaning). The bug surface — returning the cached answer for a query whose phrasing makes it actually different — is real and silent.

Embedding Cosine similarity

What's a good cache hit rate target?

Depends on the layer. Exact-match: 5-20% in chat/agent traffic, 30-60% in FAQ-style traffic. Prefix cache: 60-90%+ on stable system prompts (almost free win). Semantic cache: highly variable — design carefully and measure. Don't deploy semantic caching without an offline replay test against your actual traffic.

Cost per token RAG

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs