Pattern · I 9 min read 8 sections 8 code samples Updated May 17, 2026
This pattern is called

Eval Drift

The offline metric improves while user-signal degrades because the eval set was sampled at time T from a distribution that has since moved.

Symptom

The shipped offline retrieval metric — , , win-rate against a baseline — climbs week over week. Dashboards stay green. Candidate models keep landing.

User-side signal moves in the opposite direction. Support tickets rise. NPS falls. Power-user feedback turns sharply negative on cohorts that had quietly become a meaningful share of traffic.

“Our offline metrics keep climbing every week but our users keep saying it feels worse.”

The concrete signature:

  • Aggregate NDCG@10 up week-over-week for four releases or more in a row. Each release is honestly better than the last on the dashboard.
  • Tickets cluster in one or two feature areas that did not exist when the eval was last refreshed (a model garden, an enterprise tier, agent runs, a vertical pulled in by a recent campaign).
  • Per-cohort CSAT diverges from aggregate CSAT. Aggregate is flat or slightly up. The newest, fastest-growing cohort is the one falling.
  • Internal users like the system more than external ones do. Internal queries hit the regions the eval set was built around.

Neither side is lying. The metric is computing exactly what it has always computed. The users are reporting exactly what they are seeing.

Mechanism

An offline eval set is a snapshot of the user-intent distribution at the moment it was sampled. Every retrieval metric computed against it is conditional on that snapshot. When the snapshot is months old and the product has shipped a major feature, opened to a new tier, or run an external campaign in the interim, the snapshot is no longer the live distribution — and the metric is no longer measuring the experience users are having today.

The drift is almost always silent because it is gradual and correlated with growth. The growth of the live distribution looks identical to healthy adoption. The shift in what that distribution contains looks like nothing at all on a retrieval dashboard.

Three places the drift hides:

  • Cohort drift. The new user mix searches for different things than the old user mix did. The eval set captured the old mix.
  • Vocabulary drift. The product gains a feature (“model garden”, “agent runs”, “evals”) whose vocabulary is in the corpus but absent from the eval queries.
  • Intent drift. The same surface query expresses a different underlying need. “how do I configure X” in v1 meant the dashboard; in v2 it means the API.

A canonical shape, with rounded numbers. A docs-search product built its eval set when traffic was 100% Tier-1 self-serve. Since then:

  • An enterprise tier shipped and now drives 15–25% of queries (compliance, SAML, audit-log, custom-roles).
  • An “agent runs” feature shipped and runs 8–12% of queries (function-calling, tool-schemas, run-IDs).
  • An adjacent vertical joined and contributes 5–10% of queries (regulated-data, retention-policy, PII-handling).

Roughly 30–45% of live traffic now lives in regions the eval set never saw. The metric, computed on the original snapshot, gets more honestly better at the old distribution. It is silent on the slice where the user pain is concentrated.

The eval tracks the old intent, the old vocabulary, the old cohort. The metric is honest. The metric is also wrong about today’s product.

Diagnostic

Five tests, ordered cheap-to-expensive. Run them in order; stop at the first one that fires.

Test 1 — eval-set age (30 seconds, one SQL query)

SELECT
  MAX(created_at)               AS most_recent_eval_query,
  CURRENT_DATE - MAX(created_at) AS age_days,
  COUNT(*)                       AS n_queries
FROM eval_queries;
age_daysreading
0–60healthy; this pattern is unlikely to be the dominant cause
60–180suspect; run Test 2
180+almost certainly firing even if other patterns are also active

A single number, no judgment required. This test alone catches roughly half of real eval-drift cases.

Test 2 — coverage of recent complaints (5 minutes, small Python)

The first directional test. For the last 50 user complaints (support tickets, downvotes, NPS comments), count how many have an analogous query in the eval set.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Inline fixture — replace with real loaders.
eval_queries = [
    "how do I set up retrieval", "ndcg metric explanation",
    "rerank top-k", "embedding dimension", "what is hybrid search",
    "deduplication of documents", "fine tune reranker",
    "chunking strategy markdown", "free tier query limit",
    "store embeddings cheaply",
]
recent_complaints = [
    "SAML config for enterprise tier", "audit log retention setting",
    "custom roles for org admin", "function-calling tool schema",
    "agent run trace ID", "PII redaction in retrieved chunks",
    "compliance certifications SOC2", "regulated data residency EU",
    "rerank top-k", "fine tune reranker",
]

vec = TfidfVectorizer(min_df=1).fit(eval_queries + recent_complaints)
ev = vec.transform(eval_queries)
co = vec.transform(recent_complaints)
sims = cosine_similarity(co, ev).max(axis=1)

uncovered = (sims < 0.30).sum()
print(f"uncovered complaints: {uncovered}/{len(recent_complaints)}")
print(f"median nearest-eval similarity: {np.median(sims):.2f}")
# Expected output on the fixture:
#   uncovered complaints: 8/10
#   median nearest-eval similarity: 0.00

Healthy: under 10% uncovered. Suspect: 15–30%. Confirmed: over 30%. A high uncovered rate is the cleanest possible statement of the pattern — the eval set has no analogue for what the users are unhappy about.

Test 3 — cluster occupancy diff (30 minutes, medium Python)

The decisive test. Embed eval queries and a recent live-traffic sample into the same space, cluster the union, and read off the per-cluster occupancy delta. The self-contained fixture below uses TF-IDF for portability; production runs should swap in the deployed embedder (zembed-1, OpenAI, Cohere).

import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

np.random.seed(42)

# Old eval: Tier-1 self-serve docs queries.
eval_queries = [
    "how do I set up retrieval", "ndcg metric explanation",
    "rerank top-k tutorial", "what is hybrid search",
    "embedding dimension reduction", "free tier query limit",
    "store embeddings on disk", "deduplication of documents",
    "fine tune reranker on my data", "chunking strategy for markdown",
    "vector db comparison", "cosine vs dot product",
] * 8  # n=96

# Live week: 60% old mix + 22% enterprise + 11% agents + 8% verticals.
live_queries = (
    eval_queries[:58]
    + ["SAML config", "audit log retention", "custom roles admin",
       "SOC2 compliance certificate", "data residency EU",
       "VPC peering setup", "SCIM provisioning",
       "private endpoint configuration"] * 3              # 24 enterprise
    + ["function-calling tool schema", "agent run trace ID",
       "tool result truncation", "agent timeout policy"] * 3  # 12 agents
    + ["PII redaction", "HIPAA regulated data",
       "claims processing pipeline"] * 3                  # 9 vertical
)

vec = TfidfVectorizer(min_df=1).fit(eval_queries + live_queries)
all_emb = vec.transform(eval_queries + live_queries).toarray()
km = KMeans(n_clusters=12, random_state=42, n_init=10).fit(all_emb)

eval_lbl = km.predict(vec.transform(eval_queries).toarray())
live_lbl = km.predict(vec.transform(live_queries).toarray())
ev = Counter(eval_lbl); lv = Counter(live_lbl)
n_ev, n_lv = len(eval_queries), len(live_queries)

print(f"{'cid':>3} {'eval%':>7} {'live%':>7} {'ratio':>7}  example")
for cid in range(12):
    es = ev.get(cid, 0) / n_ev
    ls = lv.get(cid, 0) / n_lv
    if ls == 0 and es == 0: continue
    ratio = (ls + 1e-9) / (es + 1e-9)
    example = next((q for q, c in zip(live_queries, live_lbl) if c == cid), "")
    flag = "  DRIFT" if ls > 2*es and ls > 0.04 else ""
    print(f"{cid:>3} {es:>6.1%} {ls:>6.1%} {ratio:>6.1f}x  {example[:42]}{flag}")
# Expected output (illustrative, seeds will vary):
#   cid   eval%   live%   ratio  example
#     0   8.3%    7.4%    0.9x   how do I set up retrieval
#     1   8.3%    6.8%    0.8x   ndcg metric explanation
#     2   0.0%   14.8%    big    SAML config                DRIFT
#     3   0.0%    7.4%    big    function-calling tool      DRIFT
#     4   0.0%    5.6%    big    PII redaction              DRIFT
#     ...

Any cluster where live > 2x eval and live > 4% of traffic is a region of intent the metric is blind to. Confirmation is when the example queries from a flagged cluster cohere into a user need that postdates the eval set.

Test 4 — per-cluster metric breakdown

Once Test 3 flags drifted clusters, the next question is whether the metric itself is hiding the regression behind the aggregate average. Re-run the release-eval per cluster:

# Pseudocode — wire to the real eval harness.
# For each cluster, score the candidate model and the baseline,
# report per-cluster NDCG@10 delta.

per_cluster = {}
for cid, queries_in_cluster in clusters.items():
    if len(queries_in_cluster) < 5: continue  # noise floor
    cand_ndcg = ndcg_at_10(candidate_model, queries_in_cluster)
    base_ndcg = ndcg_at_10(baseline_model, queries_in_cluster)
    per_cluster[cid] = {
        "n": len(queries_in_cluster),
        "candidate": cand_ndcg,
        "baseline":  base_ndcg,
        "delta":     cand_ndcg - base_ndcg,
    }

# Print the regressions, worst first.
for cid, m in sorted(per_cluster.items(), key=lambda x: x[1]["delta"]):
    if m["delta"] < -0.01:
        print(f"cluster {cid:>2} (n={m['n']:>3})  delta={m['delta']:+.3f}")

A candidate that is +0.018 on aggregate but −0.04 on the largest drifted cluster is not a release; it is a localized regression that the aggregate is laundering. Block it.

Test 5 — KL divergence as a monitorable scalar

The five-test diagnostic suits manual investigation. Ongoing monitoring needs a single scalar to plot and alert on. Cluster-occupancy KL divergence is the right shape — it summarizes the distance between eval and live distributions:

from scipy.special import rel_entr

p = np.array([ev.get(c, 0) / n_ev for c in range(12)]) + 1e-9
q = np.array([lv.get(c, 0) / n_lv for c in range(12)]) + 1e-9
p, q = p / p.sum(), q / q.sum()

kl = float(rel_entr(q, p).sum())  # KL(live || eval)
print(f"KL(live || eval) = {kl:.3f}")
# Expected: ~0.4–0.6 on the fixture above — clearly out of healthy band.

Healthy: KL < 0.15. Alert: KL > 0.3. Five-alarm: KL > 0.5. Plot weekly, alarm on threshold crossings, regenerate the eval set when the alarm fires.

Worked example end-to-end

A typical run of the diagnostic on a docs-search product. Test 1 returns age_days = 247 — the eval predates two major releases. That alone justifies a refresh, but the rest of the diagnostic localizes the drift. Test 2 returns 31/50 recent complaints uncovered. Test 3 flags three clusters: SAML/SSO (15% of live, 0% of eval), agent runs (8% of live, 0% of eval), and a regulated-data cluster (6% of live, 0% of eval). Test 4 reports aggregate NDCG@10 of +0.012 against the last release, with SAML at −0.067 and agents at −0.041. Test 5: KL(live || eval) = 0.51.

Treatment §1 refreshes the eval set. §3 adds per-cluster gating. §4 adds a usefulness dimension to the LLM judge prompt. The aggregate NDCG@10 on the new (harder) eval drops by roughly one point and stays correlated with NPS for the next two quarters. The drop is the metric finally telling the truth.

Treatment

Order matters. Each step assumes the previous one is done.

1. Refresh the eval set from a stratified live sample

Pull a fresh sample of recent live queries, stratified by intent cluster so rare-but-important regions (regulated-data, new-tier features) are represented at parity. Do not merge with the old set — half-refreshes leave a long tail of stale labels that mask the next drift as it forms.

# Stratified resample from clustered live traffic.
# Each cluster contributes proportional-with-floor: max(p_cluster * N, floor).

def stratified_resample(live, labels, n_target=1000, floor=20):
    out, ids = [], []
    counts = Counter(labels)
    for cid, ct in counts.items():
        share = ct / len(labels)
        n_take = max(int(share * n_target), floor)
        cluster_idxs = [i for i, l in enumerate(labels) if l == cid]
        chosen = np.random.choice(cluster_idxs, size=min(n_take, len(cluster_idxs)), replace=False)
        out.extend([live[i] for i in chosen])
        ids.extend(chosen.tolist())
    return out, ids

The floor matters. A cluster at 0.5% of live traffic is still a real cohort that can pull NPS down, and a proportional-only sample undersamples it into noise. A floor of 20–40 examples buys statistical power on tail cohorts without flooding the eval with them.

2. Lock a refresh cadence + trigger-based refresh

Quarterly is the default for products in active development. Slipping the cadence is how this pattern returns. Trigger-based refresh covers the fast cases:

  • A new feature ships with vocabulary the corpus has but the old eval does not.
  • A new tier or vertical onboards more than 5% of weekly volume.
  • The Test 5 KL divergence crosses the alert threshold.

The cadence catches slow drift. The triggers catch the fast drift that takes one release to ship and three months to surface.

3. Per-cluster gating in the release process

The release-eval should not report a single aggregate metric. It should report aggregate plus per-cluster, and refuse to ship any change that regresses a cluster with n >= 5 by more than one NDCG@10 point.

# Pseudocode for the release-eval gate.
THRESH_DROP = 0.01  # 1 NDCG point
def release_gate(per_cluster):
    blockers = []
    for cid, m in per_cluster.items():
        if m["n"] >= 5 and m["delta"] < -THRESH_DROP:
            blockers.append((cid, m))
    if blockers:
        raise ReleaseBlocked(
            "Per-cluster regression: " +
            ", ".join(f"c{cid}:{m['delta']:+.3f}" for cid, m in blockers)
        )

This single check eliminates the most common eval-drift-shaped incident: the aggregate-positive, cluster-negative release.

4. Add usefulness to the LLM-judge prompt

The deepest patch, and the one that survives future drift. The LLM-as-judge typically scores topical relevance. A doc can be topically relevant and useless — wrong access tier, wrong language, stale version, replaced by a newer doc. The judge should score topical AND usefulness and report min(topical, usefulness):

You are scoring whether RETRIEVED is a useful answer for QUERY.

Return JSON:
{
  "topical_relevance": 0-3,   // does the doc discuss the query topic?
  "usefulness":        0-3,   // would a user be HELPED by reading this doc
                              // RIGHT NOW, given recency, access tier,
                              // language, deprecation status?
  "score":             min(topical_relevance, usefulness),
  "reason":            "<one sentence>"
}

The minimum is load-bearing. A doc that is topical (3) but stale (0) reports a score of 0, not the mean (1.5). The judge is now answering the question the user is actually asking.

What does NOT work — and is the first thing most teams try

Picking a more recent baseline and only shipping changes that beat it. The baseline is still measured against the stale eval set, so “beating it” means nothing about the live distribution. The same drift ships, gated against itself.

The metric is honest. The metric is also wrong about today’s product. Refresh the snapshot, gate per-cluster, judge for usefulness — in that order.

This isn’t this pattern when…

ObservationProbable patternRead next
Eval age is healthy, KL is low, users still complainRight doc retrieved, generator hallucinating around itRight doc, wrong answer
Eval refreshed, gate green, threshold-tuned model degraded overnightScore-distribution shift under model swapThreshold by feel
Aggregate is fine, one specific question type failsLong-context inattention, not eval-set stalenessContext rot
Metric and users agree, but judge bill is the problemEval is honest but expensiveEval spend overrun
Distilled judge metric drifts while teacher-judge stableStudent degraded on live distributionDistillation drift

The disambiguation rule of thumb: eval drift moves the distribution of queries; threshold drift moves the distribution of scores; distillation drift moves the student–teacher agreement. Same surface symptom (“the dashboards lie”), different mechanism underneath.

Numbers that matter

signalhealthysuspectconfirmed
eval set age (days)0–6060–180180+
uncovered-complaint rateunder 10%15–30%over 30%
any cluster ratio (live/eval)under 2x2–5xover 5x
per-cluster NDCG@10 deltaflat to +0.01−0.01 to −0.02under −0.02
KL(live ‖ eval)under 0.150.15–0.30over 0.30

These are starting thresholds. Each team’s “healthy KL” depends on how heterogeneous the query distribution is at baseline.

Adjacent patterns

  • Right doc, wrong answer: retrieval is correct but the LLM’s output contradicts the retrieved doc. Eval drift can mask this — once the eval is refreshed and the metric-vs-user-signal gap persists, the failure has moved to the generation layer.
  • Threshold by feel: a calibrated-relevance threshold worked under one model and fails under the next. Eval drift moves the distribution of queries; threshold drift moves the distribution of scores. Different mechanism, same surface symptom.
  • Distillation drift: structurally the same shape (a snapshot becomes stale) applied to a student-teacher relationship rather than a metric-user relationship. A drifted eval lets the canary pass — it is just measuring the wrong thing.

When the eval is refreshed, per-cluster gating is in place, the judge scores usefulness, and the user signal still disagrees with the metric, the pattern at play is one of the three above, not this one.

The team writing this ZeroEntropy trains specialized small models (zembed-1, zerank-2) for the production stacks where these patterns show up.
ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord