Semantic Cache

Also known as: embedding cache, approximate query cache, fuzzy LLM cache

TL;DR

A semantic cache returns a cached LLM response when an incoming query is similar enough — by embedding cosine similarity — to a previous query, rather than requiring exact-string match.

A semantic cache returns a cached LLM response when an incoming query is semantically close to a previously-seen query, not just an exact-string match. The mechanism is straightforward — embed the query, do a nearest-neighbor lookup in a small index of past queries, and if cosine similarity exceeds a threshold, return the cached response. The economics are compelling on repetitive workloads: a 30% cache hit rate translates to 30% lower LLM cost and most of the latency budget back.

How it works mechanically

The cache stores tuples of (query_embedding, query_text, response, metadata). On request:

Embed the incoming query using the same embedding model used to populate the cache.
Query a small approximate-nearest-neighbor index (FAISS, Pinecone, in-memory) for the closest past query.
If cosine_similarity ≥ threshold, return the stored response. Cache hit.
Otherwise, run the LLM, store the new (query, response) in the cache, return the response.

The threshold is the only knob, and it’s the only thing that matters for quality.

When it pays off

Examples

Customer support: 70%+ of incoming questions are paraphrases of FAQ. Hit rate is 40-60%. Big win.
Common code-generation patterns (“write a Python function to read a CSV”). Hit rate 20-40%.
Common SQL or analytics questions over a fixed schema. Hit rate 30-50%.
LLM-as-judge evaluations on a fixed eval set. Hit rate 95%+ if eval set doesn’t change.
Repeated identical agent sub-tasks within a session.

For these workloads, a semantic cache cuts cost-per-token and tail latency simultaneously. For research assistants, novel content generation, or any workload where queries are inherently unique, hit rates approach zero and the cache is dead weight.

The false-positive risk

Threshold tuning

The procedure:

Run unfiltered semantic-cache lookups on a sample of production traffic. Record (query, nearest-neighbor cached query, similarity, cached response).
Have a stronger model (or human) judge: would the cached response correctly answer the new query? Label as correct / incorrect.
Plot false-positive rate vs threshold. At threshold 0.85, FPR might be 12%; at 0.92, 4%; at 0.97, 1%.
Pick the threshold where FPR is below your tolerance. For high-stakes flows (medical, legal, billing), tolerance is 0%-1% — threshold 0.97+. For low-stakes (support FAQs), 5-10% may be fine — threshold 0.88-0.92.

The trade-off: lower threshold = higher hit rate = higher cost savings, but more wrong answers. Higher threshold = fewer hits but safer.

A second guard: pair the semantic cache with an LLM-as-judge verification on a sample of cache hits. If the judge flags >X% as incorrect, retighten the threshold.

Embedding choice matters

The cache’s quality depends entirely on whether the embedding model places semantically equivalent queries close and semantically distinct queries far. A bad embedding produces 0.95 cosine similarity between unrelated queries; a good one keeps near-paraphrases at 0.95+ and distinct queries below 0.85.

In practice: use a strong general-purpose embedding (E5, BGE, OpenAI text-embedding-3, zembed) and tune threshold per-deployment. Don’t reuse a generic embedding for high-stakes flows without measuring per-domain false-positive rate.

Cache invalidation

Cached responses go stale when the underlying knowledge changes — product changes, policy updates, retrieved-document churn. Two strategies:

TTL. Every cache entry expires after N hours/days. Crude but reliable.
Source-aware invalidation. If the cached response was generated from retrieved docs, invalidate when those docs change. Requires tracking provenance.

For RAG-style workloads, source-aware is correct but engineering-heavy. TTL is the default; tune based on how fast your knowledge base changes.

Composition with other caches

The full LLM-serving cache hierarchy (see caching strategies ):

Exact-prompt cache. Hash the full prompt. Always-correct, low hit rate.
Prefix KV-cache. Reuses the system-prompt KV across requests. Server-side, high hit rate on shared prompts.
Semantic cache. Approximate query match. Application-side, hit rate depends on workload.

These compose: exact-prompt first (always-correct cheap win), then semantic (approximate, more savings). Prefix KV-cache lives at the serving layer ( vLLM serving ) and is orthogonal — it speeds up cache misses.

Operational metrics

Monitor: cache hit rate, false-positive rate (sample-checked via LLM-as-judge), threshold drift, cost-per-query before/after cache. Drift detection on the input-query embedding distribution is the early warning when your cache stops being useful — if today’s queries embed to a different region than yesterday’s, your hit rate is about to drop.

Go further

How do I pick the similarity threshold?

Lower thresholds (e.g., cosine 0.85) give more cache hits but more false positives — the cached answer doesn't actually fit the new query. Higher (0.97+) is safe but rarely hits. Tune empirically: sample 1000 cache-hit pairs, manually label whether the cached response answers the new query, and pick the threshold where false-positive rate drops below your tolerance.

Cosine similarity Embedding

When is a semantic cache worth the complexity?

When your query distribution is repetitive — customer support FAQs, common SQL/data questions, common-pattern code generation. Workloads with high query diversity (research assistants, novel prompts) get near-zero hit rates and aren't worth the cache infrastructure. Measure hit rate on a representative traffic sample before committing.

Caching strategies Cost per token

Should I cache the embedding too?

Yes — re-embedding the query on every request is wasted work. Cache (query string → embedding) at the request edge. The semantic cache lookup then becomes one nearest-neighbor query, no embedding cost. For very-high-QPS systems, cache embeddings in Redis or a small ANN index in process memory.

ANN nearest-neighbor Embedding

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs