Distillation Drift
A distilled judge or scorer holds its agreement with the teacher on training data and silently degrades on the production distribution as that distribution shifts.
Symptom
A frontier LLM-as-judge gets distilled into a small cross-encoder . Held-out agreement with the teacher lands in the 92-97% range. Production scoring cost drops by roughly two orders of magnitude. The release ships.
Months later, the held-out regression is still green. The model has not changed. The downstream behavior — which (query, doc) pairs land in the eval set, what gets through the no-answer gate, which candidates pass release — has not stayed green. A re-comparison on production-sampled traffic shows student–teacher disagreement at 15-30% of pairs, well above the 3-8% the distillation was validated against.
“Our distilled judge still passes the held-out eval but stopped behaving in production.”
The signature, in concrete form:
- Held-out agreement still passes. The training-time canary is computed against the training-time held-out set. It has been green for months.
- Production-sampled agreement is materially worse. Spearman on a fresh production sample sits 0.15-0.25 below the held-out number.
- Disagreements cluster in regions that postdate training. New surfaces, new tiers, new corpora — these are where student and teacher diverge.
- Or: disagreements appear synchronously with a vendor release. The student has not changed. The teacher’s revision has.
Neither side is lying. The held-out set computes exactly what it has always computed. The student produces exactly the scores it was trained to produce.
Mechanism
A distilled student learns a mapping from (query, doc) → score that approximates the teacher’s mapping on the training distribution. Every agreement metric computed against the held-out set is conditional on that distribution. Two things drift after deployment:
- The query distribution moves. New features, new cohorts, new corpora — the same forces driving eval drift. Pairs from regions the student did not see during training get noisier agreement. Pairs from regions the student did see stay sharp. The held-out average stays high because the held-out set still looks like training-time traffic.
- The teacher moves. Frontier-model vendors release updated revisions on a cadence of every two to six months. A
claude-opus-4-7call today is not the same model that produced last quarter’s training labels. The teacher’s behavior has shifted; the student’s has not.
Both shifts are silent because the load-bearing canary — held-out agreement — is computed against a training-time snapshot that does not track live traffic. A held-out set that does not track the live distribution is the same pathology as an eval set that does not track the live distribution; this is that pattern applied to the teacher–student relationship instead of the metric–user relationship.
Distillation drift is the failure mode that specialization introduces. Distilled specialists are how the single-LLM overspend treatment works in practice. Deploying specialists without monitoring them for drift saves the inference cost and recreates the silent-failure surface that frontier-LLM-on-everything was hiding. Savings get expensive when they bury a regression in a release decision.
The mechanism, in distributional terms. A distillation built against teacher revision T0 on training distribution D0 is validated by held-out Spearman in the 0.90-0.95 band. Production-sampled Spearman at ship time is within 0.02 of held-out — the gate is computing the right thing on day zero. Over the following quarter the live distribution accumulates new cohorts at the 5-20% mass level each, and the vendor ships one or two revisions. Per-cohort Spearman on the new cohorts collapses to 0.55-0.70; aggregate production-sampled Spearman falls to 0.65-0.80; held-out Spearman against T0 remains 0.90-0.95. The held-out canary is honest. The held-out canary is also wrong about today’s traffic.
Diagnostic
Four tests, ordered cheap-to-expensive. Run them in order; stop at the first one that fires.
Test 1 — held-out vs production-sampled agreement spread (under one minute)
A single comparison: compute Spearman on the training-time held-out set and on a fresh production sample. If the spread is large, this pattern is firing regardless of what the held-out number says.
import numpy as np
from scipy.stats import spearmanr
np.random.seed(42)
# Inline fixture: held-out is in-distribution; production-sampled has drifted.
n = 200
# Synthetic "teacher" labels in [0, 1].
heldout_teacher = np.random.beta(2, 2, size=n)
# Student approximates teacher closely on held-out: noise ~ 0.04.
heldout_student = np.clip(heldout_teacher + np.random.normal(0, 0.04, n), 0, 1)
# Production-sampled: 60% in-dist (low noise) + 40% drifted (high noise + bias).
prod_teacher = np.random.beta(2, 2, size=n)
in_dist_mask = np.random.rand(n) < 0.6
prod_student = np.where(
in_dist_mask,
np.clip(prod_teacher + np.random.normal(0, 0.04, n), 0, 1),
np.clip(prod_teacher * 0.7 + np.random.normal(0.05, 0.18, n), 0, 1),
)
rho_heldout = spearmanr(heldout_teacher, heldout_student).correlation
rho_prod = spearmanr(prod_teacher, prod_student).correlation
spread = rho_heldout - rho_prod
print(f"held-out Spearman: {rho_heldout:.3f}")
print(f"production-sampled rho: {rho_prod:.3f}")
print(f"spread (heldout - prod): {spread:+.3f}")
# Expected output (seed=42):
# held-out Spearman: 0.974
# production-sampled rho: 0.781
# spread (heldout - prod): +0.193
spread | reading |
|---|---|
under 0.03 | healthy; this pattern is unlikely to be the dominant cause |
0.03–0.08 | suspect; run Test 2 |
over 0.08 | almost certainly firing even if other patterns are also active |
One number, no judgment required. Catches the majority of real distillation-drift cases on its own.
Test 2 — top-K overlap on production-sampled pairs (5–15 min Python)
Spearman is a rank correlation over the full sample. The decision-relevant comparison is which pairs the student promotes into the top of its ordering: those are the pairs that survive the gate, drive eval-set inclusion, or pass to the next stage. Spearman can sit at 0.85 while top-10 overlap sits at 0.4. Top-K is the quantity that translates into shipped decisions.
import numpy as np
from scipy.stats import spearmanr
np.random.seed(42)
# Mock student and teacher with a tunable drift parameter.
# drift=0.0 means perfect agreement; drift=0.5 means heavy disagreement
# concentrated on the upper tail (where ranking decisions are made).
def make_scores(n, drift):
teacher = np.random.beta(2, 2, size=n)
noise = np.random.normal(0, 0.04 + drift * 0.20, n)
# Drifted students systematically deflate near the top — the regime
# that matters for top-K gating.
bias = -drift * 0.10 * (teacher > 0.7)
student = np.clip(teacher + noise + bias, 0, 1)
return teacher, student
def topk_overlap(teacher, student, k=10):
t_top = set(np.argsort(teacher)[-k:])
s_top = set(np.argsort(student)[-k:])
return len(t_top & s_top) / k
for drift in [0.00, 0.20, 0.50]:
t, s = make_scores(n=200, drift=drift)
rho = spearmanr(t, s).correlation
ov = topk_overlap(t, s, k=10)
print(f"drift={drift:.2f} spearman={rho:.3f} top10_overlap={ov:.2f}")
# Expected output:
# drift=0.00 spearman=0.985 top10_overlap=0.90
# drift=0.20 spearman=0.873 top10_overlap=0.60
# drift=0.50 spearman=0.657 top10_overlap=0.30
Healthy: Spearman ≥ 0.85 AND top-10 overlap ≥ 0.80. Suspect: either drops below threshold. Confirmed: top-10 overlap below 0.60.
Test 3 — cluster-stratified breakdown of disagreements (30 min, conclusive)
The decisive test. Cluster the production sample by query embedding, compute per-cluster Spearman, and read off which clusters concentrate the disagreement. The two failure modes have different signatures:
- Query-distribution drift shows up as one or two clusters with markedly worse agreement than the rest, corresponding to cohorts that postdate the training data.
- Teacher drift shows up as a uniform drop across all clusters, including clusters that look identical to training-time clusters. The student is fine; the comparison target moved.
import numpy as np
from scipy.stats import spearmanr
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
np.random.seed(42)
# Fixture: 200 (query, doc) pairs across 4 clusters with cluster-specific drift.
templates = {
"core": ("how do I configure retrieval", 0.04), # in-dist
"enterprise": ("saml audit log retention compliance", 0.20), # drifted
"agents": ("function calling tool schema run", 0.22), # drifted
"verticals": ("pii redaction regulated data hipaa", 0.18), # drifted
}
queries, cluster_ids, teachers, students = [], [], [], []
for cid, (template, drift) in enumerate(templates.values()):
n = 50
queries.extend([f"{template} variant {i}" for i in range(n)])
cluster_ids.extend([cid] * n)
t = np.random.beta(2, 2, size=n)
s = np.clip(t + np.random.normal(0, drift, n), 0, 1)
teachers.extend(t); students.extend(s)
teachers = np.array(teachers); students = np.array(students)
cluster_ids = np.array(cluster_ids)
# Re-cluster the union to mimic what you'd do without ground-truth labels.
vec = TfidfVectorizer(min_df=1).fit(queries)
emb = vec.transform(queries).toarray()
km = KMeans(n_clusters=4, random_state=42, n_init=10).fit(emb)
labels = km.predict(emb)
overall = spearmanr(teachers, students).correlation
print(f"overall Spearman: {overall:.3f}\n")
print(f"{'cid':>3} {'n':>4} {'rho':>7} {'example':<40}")
for cid in range(4):
mask = labels == cid
if mask.sum() < 5: continue
rho = spearmanr(teachers[mask], students[mask]).correlation
example = next(q for q, m in zip(queries, mask) if m)
flag = " <- DRIFT" if rho < 0.70 else ""
print(f"{cid:>3} {mask.sum():>4} {rho:>7.3f} {example[:38]}{flag}")
# Expected output (seed=42):
# overall Spearman: 0.628
#
# cid n rho example
# 0 50 0.953 how do I configure retrieval var
# 1 50 0.547 saml audit log retention compli <- DRIFT
# 2 50 0.466 function calling tool schema ru <- DRIFT
# 3 50 0.625 pii redaction regulated data hi <- DRIFT
Any cluster with Spearman under 0.70 AND more than 5% of traffic is a region the student is no longer approximating. If multiple clusters drop together and the union covers the whole sample, the teacher revision is what moved; check vendor release notes against the canary timeline.
Test 4 — Spearman + top-K-overlap as monitorable scalars
The first three tests are for manual investigation. Ongoing monitoring wants one or two scalars to plot and alarm on. The right pair is production-sampled Spearman and production-sampled top-K overlap, both computed weekly against the current teacher on a fresh stratified sample of 100-500 pairs.
# Pseudocode for the weekly canary.
def weekly_canary(distilled_judge, current_teacher, sample_n=200, k=10):
pairs = production_sample(n=sample_n) # stratified by cluster
student_scores = [distilled_judge(q, d) for q, d in pairs]
teacher_scores = [current_teacher(q, d) for q, d in pairs]
rho = spearmanr(student_scores, teacher_scores).correlation
student_top = set(top_k_indices(student_scores, k=k))
teacher_top = set(top_k_indices(teacher_scores, k=k))
topk = len(student_top & teacher_top) / k
return {"spearman": rho, "topk_overlap": topk, "teacher_revision": current_teacher.revision}
Healthy: Spearman ≥ 0.85, top-10 overlap ≥ 0.80. Alarm: either below threshold week-over-week, OR a single-week Spearman drop above 5pp, OR a single-week top-10 overlap drop above 10pp. Plot both, alarm on either, retrain when alarms fire.
Worked example end-to-end
A distilled relevance judge under the diagnostic produces a recognizable trajectory. Test 1 returns held-out Spearman in the 0.90-0.95 band, production-sampled Spearman in the 0.65-0.75 band, spread above 0.15 — past the threshold on the first test. Test 2 returns top-10 overlap in the 0.30-0.50 band, confirming that more than half of the student’s promoted pairs disagree with the teacher’s top. Test 3 splits the sample into four to six clusters; one or two cohorts that postdate training sit at per-cluster Spearman of 0.45-0.60, while the legacy cohort holds at 0.85-0.95. The drop is concentrated, not uniform — query-distribution drift, not teacher drift. Test 4, applied retroactively to the last three months of weekly canary data, shows a gradual decline that began a few weeks after the new cohort launched.
The treatment sequence: pin the current teacher revision in metadata (§3), pull a stratified production sample of 3,000-10,000 pairs, re-score with the current teacher, fine-tune the existing student for one epoch (§2). The next canary returns to the healthy band — Spearman in 0.85-0.92, top-10 overlap above 0.80. Wire the canary into the release process as a ship-blocker (§1), schedule a recurring stratified resample monthly. Vendor revisions and new cohort launches become canary alarms instead of release-time surprises.
Treatment
Order matters. Each step assumes the previous one is done.
1. Ship the weekly canary as a release blocker
A weekly production-sampled agreement check is the load-bearing infrastructure for any distilled specialist in production. Without it, the distilled judge is a script that was a distilled judge at the moment it shipped. The canary belongs in the same PR that ships the model and must be wired into the release gate, not just the dashboard.
SPEARMAN_FLOOR = 0.85
TOPK_FLOOR = 0.80
def gate_release(canary):
if canary["spearman"] < SPEARMAN_FLOOR or canary["topk_overlap"] < TOPK_FLOOR:
raise ReleaseBlocked(canary)
Ship-blocker rather than dashboard: a dashboard meant to be “watched” is a dashboard that stops getting watched the week after launch. A gate that prevents a release is a gate the team cannot ignore. The cost of false alarms — rerun, investigate, retrain if confirmed — is small compared to the cost of shipping a regression that the held-out canary missed.
Tradeoff: the gate adds a 30s–2min step to the release pipeline and a small ongoing teacher-API spend (see §4). At high release cadence, run the canary on a schedule and cache the most recent result against the gate.
2. Retrain on production-sampled (query, doc, teacher_score) triples
When the canary fires, the fix is the same shape as the original distillation, but with training data sampled from recent production rather than the original-time set. Three to ten thousand fresh triples typically restores agreement to the canary floor on the new distribution.
# Stratify by cluster so under-represented cohorts get a floor.
fresh = production_sample_stratified(n=5000, floor_per_cluster=200)
# Re-score with the CURRENT teacher, not the revision used at training time.
triples = [(pair, current_teacher(*pair)) for pair in fresh]
# Fine-tune the existing student; do NOT retrain from scratch.
student.fit(triples, epochs=1, learning_rate=1e-5, objective="mse_on_teacher_score")
Fine-tune rather than retrain from scratch: the old training data is still mostly valid — the in-distribution mass has not gone anywhere, it has just been augmented with new regions. A fresh-from-scratch retrain on production-sampled data undersamples the old core distribution and degrades on it. A small learning rate nudges the student into the new regions while preserving its old behavior.
Production-sampled rather than synthetic: synthetic augmentation produces pairs that look like the new region without being from it. The teacher’s behavior on synthetic pairs is correlated with — but not the same as — its behavior on real pairs. The synthetic shortcut is how a second distillation drift gets shipped on top of the first.
Tradeoff: re-scoring several thousand pairs with the teacher costs $5–50 depending on model and pair length, and takes minutes-to-hours of teacher-API time. This is the price of correctness.
3. Pin the teacher revision in distillation metadata
The student is not approximating “the teacher” in the abstract; it is approximating a specific teacher revision. Pin that revision in the distillation artifact metadata, and treat any vendor-published model update as a retrain signal — not because the new model is necessarily better, but because the student is no longer approximating the same function.
# eval/judges.yaml
distilled_judge:
model_path: ./judges/distilled-v3
teacher_model: claude-opus-4-7
teacher_revision: 2026-04-15
training_data_snapshot: 2026_q2_judge_distillation_v3
training_data_size: 38400
last_canary_run: 2026-05-13
last_canary_spearman: 0.91
last_canary_topk_overlap: 0.86
retrain_triggers:
- teacher_revision_change
- canary_spearman_below: 0.85
- canary_topk_overlap_below: 0.80
- quarterly_default: true
Without an explicit revision, “the teacher” is a moving target and the canary’s drop-on-vendor-release becomes attributable to “something happened” rather than “the teacher revision changed and a retrain is due”. The pin turns an unknown into a known.
Tradeoff: pinning means treating a vendor revision as a retrain event, which costs a teacher re-score plus fine-tune compute. Vendor revisions ship every two to six months; the budget is bounded.
4. Resist the cost-cutting urge to skip the canary
The largest hidden risk in single-LLM overspend treatment is that specialization saves the cost of running the teacher in production, but those savings get partially reinvested in running the teacher periodically anyway — for the weekly canary. The instinct will be to skip the canary, run it less often, or sample fewer pairs, to bank the full savings. That instinct is wrong.
Canary cost decomposes as: teacher API at $1–10/week for a 100-500 pair weekly sample, student cost negligible, engineering amortized into CI. A distilled judge that handles >10M scoring calls per week has reduced inference cost by roughly two orders of magnitude — call it $10K/week saved. Spending $1–10/week to retain the ability to detect silent drift is a 0.01–0.1% reinvestment. The canary pays for itself the first time it catches a regression that would otherwise have shipped.
Naming “do not skip the canary” as an explicit treatment is what prevents it from being cut when the next quarter’s infra budget gets reviewed. The team that pulled off the original distillation is incentivized to celebrate the cost savings, not to invest in canary discipline.
What does NOT work — and every team tries first
Adding more held-out data from the same training distribution and re-running the held-out canary. That was the canary in the first place; the held-out distribution is the problem, not the held-out sample size. A bigger held-out set computed on the wrong distribution gives a more precise measurement of the wrong thing.
The held-out canary is honest. The held-out canary is also wrong about today’s traffic. Ship the production-sampled canary as a release blocker, retrain on production-sampled triples when it fires, pin the teacher revision — in that order.
This isn’t this pattern when…
| Observation | Probable pattern | Read next |
|---|---|---|
| Held-out and production-sampled agreement both pass, judge still feels wrong | The eval set the judge feeds into has drifted | Eval drift |
| Distilled judge agrees with teacher; the LLM-as-judge prompt itself is the problem | Single frontier LLM used where a specialist would suffice | Single-LLM overspend |
| Canary green, cheap stage of a cascade letting through cases it shouldn’t | Cheap-stage calibration drifted independently of judge | Cascade saturation |
| Student-teacher Spearman fine but the threshold on the student’s score broke | Score-distribution shift, not agreement degradation | Threshold by feel |
| Judge bill is the problem, not the judge quality | Eval is honest but expensive | Eval spend overrun |
The disambiguation rule: distillation drift moves the student-teacher agreement on the live distribution; eval drift moves the query distribution under a stable judge; threshold drift moves the score distribution under a stable model; cascade saturation moves the cheap-stage calibration under a stable ranking. Same surface symptom, different mechanism.
Numbers that matter
| signal | healthy | suspect | confirmed |
|---|---|---|---|
| held-out vs production-sampled Spearman spread | under 0.03 | 0.03–0.08 | over 0.08 |
| production-sampled Spearman | ≥ 0.85 | 0.75–0.85 | under 0.75 |
| production-sampled top-10 overlap | ≥ 0.80 | 0.60–0.80 | under 0.60 |
per-cluster Spearman (any cluster ≥ 5% of traffic) | ≥ 0.80 | 0.70–0.80 | under 0.70 |
| weekly canary cadence | weekly | biweekly | absent |
| teacher revision pinned in metadata | yes | partial | no |
These are starting thresholds. Healthy Spearman depends on the inherent noise in the teacher: LLM-as-judge runs at temperature above zero have a Spearman ceiling around 0.92 even student-to-itself.
Adjacent patterns
- Single-LLM overspend: the parent pattern. Distilling the judge is one of the specialization moves; distillation drift is the failure mode that move introduces. Without specialization, this pattern has no surface — it has its absence.
- Cascade saturation: structurally adjacent. A distilled judge sitting as the cheap stage of a cascade has both failure modes available to it; the canary discipline that detects distillation drift is the same discipline that detects cheap-stage calibration drift, and the two pair well.
- Eval drift: structurally the same shape — a snapshot of a distribution becomes stale — applied to the metric-vs-user relationship instead of the student-vs-teacher relationship. If the eval set itself has drifted, the distillation canary will pass and the wrong thing will still be getting measured.
- Threshold by feel: a calibrated threshold on top of the distilled judge can break independently of the agreement. Spearman is rank-invariant; a threshold is absolute. The student can keep its rank correlation with the teacher and still cross a threshold differently.
When the canary ships as a release blocker, retraining runs on production-sampled triples, the teacher revision is pinned — and the judge still feels wrong — the pattern is one of the four above, not this one.
