Drift Detection

Also known as: distribution drift, data drift, model drift, concept drift

TL;DR

Monitoring distributional shift in inputs, outputs, or intermediate signals of a retrieval or LLM pipeline. The discipline that catches 'the metric is silently moving' before users notice.

Drift detection monitors distributional shift in a retrieval or LLM pipeline. The inputs and outputs are sequences and embeddings rather than the easy-to-monitor scalars of traditional services, but the failure mode is the same: yesterday’s distribution is not today’s, and your downstream metrics may be silently moving in response.

What drifts

Three layers of drift in a typical retrieval/LLM stack:

Input drift. Users start asking different questions. New product launches, news cycle, seasonality, new customer cohort. The query distribution moves; everything downstream sees a shifted load.
Intermediate drift. Retrieved-document distribution shifts. Reranker score distribution shifts. Embedding-space density shifts. These are often the earliest signals — they move before user-facing metrics do.
Output drift. Response length distribution shifts. Refusal rate drifts. Hallucination rate climbs.

A pipeline monitoring only user-facing metrics (NDCG@10 on a static eval set, end-to-end answer quality) is blind to upstream drift until customer-impacting failures stack up. Drift detection is the early warning.

How to detect it

The standard tools:

KS tests / population stability index. Compare current-window distributions of a scalar metric (query length, score, latency) to a baseline window. PSI > 0.2 typically means significant shift.
Embedding-space drift. Compute the centroid and covariance of embedded queries (or documents) over a sliding window. Track distance to the historical centroid. Cluster shift, density shift, novel cluster emergence are all detectable.
Score distribution drift. A well-calibrated reranker’s score distribution should be stable across queries of similar type. If yesterday’s “high-confidence” scores were 0.7-0.9 and today’s are 0.4-0.6, something upstream changed — even if the order is preserved.
Reference-set replay. Periodically run a fixed query set through the pipeline and compare to a baseline output. Cheap to implement, catches regressions hard to see in aggregate metrics.

The naive choice — a single rolling 24-hour window with a KS test against the previous 24 hours — has two failure modes. First, it’s blind to gradual drift: each day’s window looks similar to the last, but a month later the distribution has moved 3σ. Second, it’s deafened by intra-day cyclicality: morning enterprise traffic and evening consumer traffic look like drift to each other every single day. Neither failure mode is fixable by tweaking the test threshold; you need a different windowing.

The shape that actually works is a tiered window structure. Maintain a long-baseline reference (typically 30 days, sampled hourly to neutralize cyclicality), a medium window (7 days), and a short window (24 hours). Compute pairwise distributional distances at each pair: short-vs-medium catches sudden shifts, medium-vs-long catches gradual drift, short-vs-long flags acute regressions. Each tier gets its own threshold tuned to its noise floor.

For the test itself, KS is fine for univariate scalars but fragile in higher dimensions. For embeddings specifically, MMD (Maximum Mean Discrepancy) with an RBF kernel handles distributional drift in vector space better than centroid distance — centroid drift can be zero while the distribution has bifurcated into two new clusters. Energy distance is a cheaper alternative that catches similar failure modes.

The threshold question is brutal in practice. A PSI threshold of 0.2 catches “real” drift but also fires on every Black Friday. The honest answer is that drift thresholds need a several-week burn-in where you log everything, label retrospectively, and tune. Anyone offering you a universal threshold is selling something.

What to monitor at each pipeline stage

Query input. Length distribution, language distribution, embedding centroid, novel-token rate, query-type classifier output histogram.
First-pass retrieval. Recall against a reference query set, score distribution histogram, top-k overlap rate vs baseline, retrieval latency tail.
Reranker. Score distribution, calibration check (predicted-relevance vs reference-relevance), top-k stability, per-query score variance.
LLM output. Response length distribution, refusal rate, hedging-language rate, tool-call distribution, parse-failure rate on structured outputs.
End-to-end. User feedback signals (thumbs, regenerate rate), conversation length, abandonment rate.

Why retrieval pipelines drift differently than chat

Chat-style LLM observability focuses on response quality. Retrieval pipelines have an extra surface area:

Index drift. Documents are added, removed, modified. Yesterday’s relevant doc may not exist today.
Embedding model drift. If you re-embed the corpus or upgrade the model, the embedding space shifts; old query embeddings may retrieve different docs.
Reranker drift. Same query, same docs, drifted reranker scores → different top-.

Each is invisible to a global “answer quality” metric until it accumulates. Per-stage drift monitors catch them earlier.

What to do about it

Detected drift is an alarm, not a fix. Workflow:

Triage. Is this expected (new product launched, traffic spike from a marketing campaign) or anomalous (data ingestion broken, model swap pushed silently)? Look at the change at the stage where drift was detected and trace upstream.
Decide. Three options:
- Rebaseline. Drift is benign; the new distribution is the new normal. Update baseline windows.
- Retrain. The model needs fresher data. Trigger an update via synthetic data generation on the new query distribution + retraining pipeline.
- Roll back. An upstream change broke things. Revert.

The discipline is iterative: the more drift you investigate, the better your sense of which signals matter for your specific system. Drift detection is also one of the strongest arguments for calibrated reranker outputs — uncalibrated scores’ drift is statistically indistinguishable from noise, while calibrated scores’ drift is interpretable.

Go further

What's the difference between data drift and concept drift?

Data drift: the input distribution changes (users start asking different questions). Concept drift: the relationship between input and correct output changes (the world's knowledge moved on, your training data is stale). Detection is similar; remediation differs — data drift suggests the model is fine but seeing new traffic; concept drift suggests retraining.

LLM observability Fine-tuning

How do you detect drift in a retrieval pipeline specifically?

Track per-stage distributions: query embedding centroid and spread, retriever score distribution per query type, reranker score distribution, top- overlap rate vs reference. KS tests against a baseline window. Embedding-space cluster shift over time. The score-distribution shift is often the earliest signal — long before user-facing metrics move.

Score calibration Embedding

What do you actually do when drift is detected?

First, triage — is this benign (new product launched, traffic shifted) or malicious/buggy (corrupt data ingestion, upstream model change)? Then either accept (rebaseline), retrain on fresher data, or roll back the upstream change. Drift detection is the alarm; the response is judgment, and that judgment improves over time as you learn your system's normal variance.

Catastrophic forgetting Synthetic data generation

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs