LLM Observability

Also known as: LLM monitoring, LLM ops, AI observability

TL;DR

The operational discipline of monitoring LLM-driven systems: tracing per-call inputs/outputs, eval-in-prod against held-out sets, drift detection on inputs and outputs, latency and cost percentiles.

Observability for LLM-driven systems is its own discipline. The standard service-observability playbook (latency, error rate, throughput, traces) is necessary but radically insufficient — LLM systems fail in ways traditional monitoring can’t see.

What’s distinct about LLM failures

A web service fails loudly: 500s, exceptions, timeouts — observable from outside. LLM systems fail silently:

Hallucinated facts that pass any plausibility check.
Subtle quality drift: yesterday’s outputs were great, today’s are mediocre, no error rate change.
Confident wrong answers in the long tail of queries.
Reasoning failures on edge cases that 5xx monitoring can’t catch.
Retrieval drift: yesterday the top-1 doc was relevant; today it’s silently 2nd or 3rd.

The 200-OK with a wrong answer is the canonical LLM failure. Observability has to find it.

Every LLM upgrade should produce visible movement on the eval-in-prod dashboard within 48 hours. If it doesn’t, you don’t have observability — you have logs.

The four layers

1. Trace layer: capture everything

For every LLM call, log: inputs (prompt, retrieved docs, system message), outputs (response, tokens, log-probs), parameters (model, temperature), latency, cost. Production tools (Langfuse, Helicone, Langsmith, Datadog LLM Observability) handle this; the data shape is “rich span with full text fields”.

Volume note: if you’re handling 10K req/s and storing full prompts and responses, the storage bill is real. Sample aggressively at high volumes; keep full traces for low-volume but high-stakes flows.

2. Eval-in-prod layer

Sample a fraction of traffic and run a parallel quality evaluation:

LLM-as-judge . A stronger model rates the response on dimensions you care about (relevance, faithfulness, completeness).
Reference comparison. For tasks with known answers, compute exact-match or fuzzy match.
Human-in-the-loop review. For high-stakes flows (medical, legal, financial), 0.1% human spot-check.
Self-consistency. Run the same query 3 times; flag high disagreement as suspicious.

The output: time-series quality metrics alongside latency and cost. Quality regressions show up as a graph going down, not as 5xxes.

A 10K RPS stack producing tens of millions of LLM calls per day can’t judge every call — judge cost would swallow production cost. The standard pattern is stratified sampling: 0.1% of high-volume routine traffic, 1% of medium-volume, 100% of low-volume high-stakes flows. Set the per-route rate so each route delivers roughly equal judgment volume per day; that equalizes statistical power to detect regressions across routes. Always log the sampling rate per call; without it, downstream aggregation silently miscounts the population behind each metric.

3. Drift layer

Drift detection catches when input or output distributions change in ways that are likely to break things:

Input drift. Are users asking different kinds of questions than last month? KS test on query length, topic embedding distribution, language mix.
Output drift. Is the model producing different-shaped responses? Length distribution, refusal rate, common token usage.
Retrieval drift. Are the top- documents shifting in quality? Score calibration drift.

Drift doesn’t always mean “broken” — but it usually means “investigate”.

4. Cost and latency layer

Same as traditional ops, but with LLM-specific axes: tokens in, tokens out, cache hit rate, per-model split. P99 latency is the wakes-oncall number. Cost per token breakdowns by feature/customer/route are the budget conversation.

Practical setup

A typical production stack:

Tracing: Langfuse or Langsmith for full-trace storage; sampled traces in Datadog or Honeycomb for cross-service correlation.
Evals: Custom LLM-as-judge prompts for the dimensions that matter; periodic offline evals against a golden dataset to catch regressions across model upgrades.
Drift: Embedding-based input topic monitoring; per-stage score calibration checks for retrieval pipelines.
Dashboard: P50/P95/P99 latency, $/call, eval scores over time, drift alerts.

If a deployment changes nothing on the dashboard, the dashboard is wrong, not the deployment.

Go further

What's distinct about observing an LLM system vs a traditional service?

Traditional services fail loudly (5xx, exceptions, latency spikes). LLMs fail silently — wrong but plausible-sounding outputs, gradual quality drift, hallucinations that pass validation. Observability has to capture not just whether the call succeeded, but whether the response was correct, which requires evals running continuously in production.

Hallucination Drift detection

What does eval-in-prod look like?

Sample 0.1-1% of production traffic. Run a parallel evaluation: LLM-as-judge scoring, comparison against a stronger model, human review for high-stakes traffic. Aggregate quality metrics over time alongside latency/cost. The goal: detect quality regressions within hours, not weeks.

LLM-as-judge Faithfulness

Which metrics actually predict user pain?

End-to-end task success rate (was the user query actually answered correctly?). Refusal rate (is the model refusing too often?). Citation rate / faithfulness (in RAG: are answers grounded in retrieved docs?). P99 latency. Cost per session. Pure throughput and error rate are necessary but insufficient — they miss silent quality drift entirely.

Faithfulness Latency tail

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs