Threshold by Feel
A score cutoff was picked because the number looked round, not because it came from a precision/recall target — it works until the upstream model changes, and then it breaks invisibly.
Symptom
A raw model score is being compared to a hard cutoff somewhere in the pipeline. A reranker output that gates whether a document enters the LLM context. A classifier that gates whether a query is routed to a specialist. A faithfulness score that gates whether an answer is shown. A retrieval similarity that gates a no-answer response.
“We picked the score threshold because the number looked round, and now it’s quietly broken.”
The cutoff is a round number. 0.5. 0.7. 0.85. It has been the round number since the first ship. Production runs quiet for months. Then something downstream of the gate breaks:
- A model is bumped. The reranker moves from v2 to v2.5, the classifier is fine-tuned on the last quarter’s data, the faithfulness model is replaced with a distilled version. Release notes report
+2points on the in-house eval. The gate rate is not mentioned. - The gate rate jumps. The fraction of inputs filtered by the threshold climbs from single-digit percentages to roughly a quarter of traffic in a week. No alarm fires; nobody is plotting it.
- A downstream metric tanks. LLM call volume drops
10-20%, every query is suddenly routed to the same specialist, or roughly a third of answers are flagged unfaithful. The on-call hour is spent staring at the LLM, the router, the faithfulness model — never the cutoff. - Internal demos still look fine. Hand-typed queries sit at the easy end of the score distribution and clear any sensible threshold. The traffic that drives the gate rate is the bottom-quartile traffic the team never types.
- One cohort gets disproportionately filtered. Short queries, long queries, non-English queries, queries against a recently-added corpus segment. The cohort is the one whose scores happened to live near the old threshold.
Nothing about the threshold changed. The threshold was always wrong; the previous model’s score distribution was hiding the wrongness. The new model emits scores on a slightly different scale, and the hard cutoff lands at a different operating point.
The signal that would have caught it three weeks earlier — the gate rate itself — is rarely on a dashboard, because it is rarely treated as a metric worth its own panel.
Mechanism
Raw scores emitted by specialized scoring models are not calibrated probabilities. Most cross-encoders emit unbounded real values that get sigmoid-squashed at the head; sigmoid-trained ones emit values in [0, 1] but with a distribution specific to the training data, not interpretable as P(relevant). A score of 0.5 does not mean “50% chance this is relevant.” It means “the value that fell out at the layer immediately before whatever output transformation was applied, on this model, trained on this data.”
A threshold picked by eye works until the conditions under which it works change. Those conditions are:
- Model swap. A new model emits a different score distribution. Same threshold, different operating point. A cross-encoder retrained with a different loss can move its mean score by
0.1-0.2without changing ranking quality at all. - Domain shift. Corpus or query distribution moves and the old score distribution stops covering the new live distribution. Even with a frozen model, score histograms drift as inputs drift.
- Recalibration after a fine-tune. Same architecture, different head weights, different score distribution. The intuition that “fine-tune is safe because it’s the same model” is wrong at the threshold layer.
- Quantization or distillation. Quantized or distilled versions of the same scorer typically shift the score distribution measurably even when ranking-level agreement with the teacher stays above
0.95.
The failure is invisible because the threshold-gated count is rarely on a dashboard. Teams notice when a fifth of LLM calls being skipped becomes a user complaint, not when the gate rate climbs over a week.
A canonical pattern. A docs-search system runs a reranker that emits sigmoid scores, with a gate eyeballed from a histogram on a dev set at 0.50. Relevant-pair scores sit around 0.7 ± 0.15; irrelevant-pair scores around 0.3 ± 0.15. Gate rate runs in the 5-10% band; precision and recall above the gate both run in the low 0.9s.
The reranker is bumped. The new version reports +2 to +3 NDCG@10 on the internal eval. Ranking is genuinely better. But the new score head is sharper. Irrelevant-pair scores migrate up by 0.1-0.15. The gate at 0.50 now admits a much larger fraction of the irrelevant distribution. Gate rate drops to 2-3%. Precision above the gate falls into the 0.7-0.8 band — the LLM sees irrelevant context for one in four to one in five calls. NDCG@10 went up. Downstream answer faithfulness fell.
The opposite variant is equally common. A more conservatively trained successor pulls both distributions down. Relevant-pair scores drop into the low 0.6s with wider variance. Gate at 0.50 starts filtering relevant docs. Gate rate jumps to roughly a quarter of traffic. The LLM has nothing to ground on for a sizeable cohort of queries. NDCG@10 went up. The product feels broken.
Both scenarios are the same pattern. Ranking improved. The score distribution moved. The threshold, picked by feel, now operates at a precision/recall point nobody chose.
This pattern is not retrieval-specific. It applies to every score-emitting specialized model in the constellation: rerankers, faithfulness models, intent classifiers, distilled judges, routing heads, calibration heads, no-answer gates. Anywhere a number meets a cutoff, the cutoff is owed a derivation.
Diagnostic
Four tests, cheap-to-expensive. Run in order; stop at the first one that fires conclusively.
Test 1 — does the threshold have a derivation? (under 1 minute)
Open the config file. Grep for the threshold value. Then answer one question:
# Look at the file. Then ask:
# 1. Is the threshold value annotated with the labeled set and date it was derived from?
# 2. Is there a script in the repo that re-derives it from labels?
# 3. Does the threshold get re-derived when the model is bumped?
| Answer | Reading |
|---|---|
| Yes to all three | Healthy. This pattern is unlikely to be the dominant cause. |
| Yes to one, no to two | Suspect. Run Test 2. |
| No to all three | Almost certainly firing the moment the upstream model is bumped. |
A hard-coded 0.5 with no comment, no derivation script, and no CI check tying it to a model revision answers the diagnostic already. The remaining tests confirm and quantify.
Test 2 — score-distribution overlap on a labeled sample (10 minutes)
Plot the model’s scores on a labeled sample, separately for known-positive and known-negative pairs. The question is not “does the model work” — it is “is there a threshold that can work.”
import numpy as np
from collections import Counter
np.random.seed(42)
# Inline fixture: synthetic scores from a reranker on labeled pairs.
# Replace with your real (score, label) extraction.
n_pos, n_neg = 200, 800
pos_scores = np.clip(np.random.normal(0.71, 0.13, n_pos), 0, 1)
neg_scores = np.clip(np.random.normal(0.28, 0.17, n_neg), 0, 1)
scores = np.concatenate([pos_scores, neg_scores])
labels = np.concatenate([np.ones(n_pos), np.zeros(n_neg)])
# Bucketed overlap: at each cutoff, what fraction of each class is above it?
print(f"{'cutoff':>6} {'pos>=cut':>9} {'neg>=cut':>9} {'precision':>10}")
for cut in np.linspace(0.2, 0.8, 13):
pos_above = (pos_scores >= cut).mean()
neg_above = (neg_scores >= cut).mean()
n_above_pos = (pos_scores >= cut).sum()
n_above_neg = (neg_scores >= cut).sum()
prec = n_above_pos / max(n_above_pos + n_above_neg, 1)
print(f"{cut:>6.2f} {pos_above:>9.2%} {neg_above:>9.2%} {prec:>10.2%}")
# Expected output (approximate):
# cutoff pos>=cut neg>=cut precision
# 0.20 99.5% 66.6% 27.2%
# 0.35 99.0% 33.6% 42.5%
# 0.50 94.0% 9.6% 71.0%
# 0.65 66.5% 1.5% 91.7%
# 0.80 20.5% 0.0% 99.4%
If positive-class and negative-class fractions-above-cut overlap heavily across the entire score range — no row where positives are above 90% AND negatives are below 10% — no threshold will work well. The scorer’s accuracy is the problem; threshold tuning is not. A clean valley — one cutoff where positives are still mostly above and negatives are mostly below — means a threshold can be derived. Move to Test 3.
Test 3 — does the precision/recall curve agree with the current threshold? (15 minutes)
The decisive test. Compute the precision-recall curve from the same labeled scores. Find the operating point the current threshold actually lands at. A team that has been operating on the assumption of 0.9 precision while the threshold sits at 0.7 precision is firing this pattern.
import numpy as np
from sklearn.metrics import precision_recall_curve, roc_auc_score
np.random.seed(42)
# Same fixture as Test 2.
n_pos, n_neg = 200, 800
pos_scores = np.clip(np.random.normal(0.71, 0.13, n_pos), 0, 1)
neg_scores = np.clip(np.random.normal(0.28, 0.17, n_neg), 0, 1)
scores = np.concatenate([pos_scores, neg_scores])
labels = np.concatenate([np.ones(n_pos), np.zeros(n_neg)])
precision, recall, thresholds = precision_recall_curve(labels, scores)
# Where does the current "round number" threshold actually sit?
current_threshold = 0.50
idx = np.searchsorted(thresholds, current_threshold)
idx = min(idx, len(precision) - 1)
print(f"Current threshold: {current_threshold:.2f}")
print(f" precision at this cut: {precision[idx]:.3f}")
print(f" recall at this cut: {recall[idx]:.3f}")
print(f" AUC (ranking quality, threshold-free): {roc_auc_score(labels, scores):.3f}")
# Where SHOULD the threshold be for 95% precision?
target = 0.95
ok = precision >= target
if ok.any():
j = np.where(ok)[0][0]
j_thresh = min(j, len(thresholds) - 1)
print(f"\nFor precision >= {target}:")
print(f" threshold should be: {thresholds[j_thresh]:.4f}")
print(f" resulting recall: {recall[j]:.3f}")
# Expected output (approximate):
# Current threshold: 0.50
# precision at this cut: 0.710
# recall at this cut: 0.940
# AUC (ranking quality, threshold-free): 0.961
# For precision >= 0.95:
# threshold should be: 0.6188
# resulting recall: 0.760
A common reading: the gate was assumed to operate at ~0.95 precision because 0.5 is “the safe side.” It actually operates closer to 0.7. Roughly one in three “above gate” docs is irrelevant. The model is fine — AUC sits in the high 0.9s. The threshold is the problem. This single test ends roughly half of all real threshold-by-feel cases.
Running the same script on a new model after a swap typically moves the derived 0.95-precision threshold by 0.1-0.2 against the previous derivation. Same labeled set, same target operating point, different derived threshold. The old one is now wrong.
Test 4 — gate-rate drift as a monitorable scalar (5 minutes to plot, ongoing)
The other three tests confirm the pattern on a snapshot. Ongoing monitoring needs a single scalar that surfaces the failure as it forms, before the downstream metric breaks. The right scalar is the gate rate itself, with a secondary scalar of the score-distribution KL between two windows.
import numpy as np
from scipy.special import rel_entr
np.random.seed(42)
# Two weekly windows of production scores (no labels needed).
# Fixture: week A under old model, week B under new model.
week_a = np.clip(np.random.normal(0.45, 0.20, 5000), 0, 1) # old
week_b = np.clip(np.random.normal(0.55, 0.16, 5000), 0, 1) # new, sharper
threshold = 0.50
gate_rate_a = (week_a < threshold).mean()
gate_rate_b = (week_b < threshold).mean()
print(f"gate rate week A: {gate_rate_a:.2%}")
print(f"gate rate week B: {gate_rate_b:.2%}")
print(f"delta: {(gate_rate_b - gate_rate_a)*100:+.1f}pp")
# KL divergence between the two score histograms.
bins = np.linspace(0, 1, 21)
p, _ = np.histogram(week_a, bins=bins, density=False)
q, _ = np.histogram(week_b, bins=bins, density=False)
p = p / p.sum() + 1e-9
q = q / q.sum() + 1e-9
kl = float(rel_entr(q, p).sum())
print(f"KL(week_b || week_a) = {kl:.3f}")
# Expected output (approximate):
# gate rate week A: 60.36%
# gate rate week B: 37.50%
# delta: -22.9pp
# KL(week_b || week_a) = 0.241
Healthy: gate-rate week-over-week change under 2pp AND KL < 0.05. Suspect: 2-5pp shift OR KL between 0.05 and 0.15. Confirmed: gate-rate change over 5pp in a week or KL > 0.15 against the prior 4-week baseline. Plot both weekly; alarm on threshold crossings. The alarm fires when the model is bumped, when the corpus drifts, and when a fine-tune lands — the three causes of this pattern.
Test 4 catches the pattern before the downstream metric breaks. The first three confirm after the fact. Test 4 is the one wired up once and left running.
Worked example end-to-end
A docs-search system shows a 10-15% drop in LLM call volume with no obvious cause. The reranker was bumped within the prior week.
Test 1: the threshold reads RERANK_GATE = 0.5 in config.py. No comment, no derivation script, no CI check. Enough to refresh, but the diagnostic continues to locate where the new operating point landed.
Test 2: on a labeled set of 1k-5k pairs scored with the new reranker, positives center around 0.78, negatives around 0.4. A valley sits around 0.6 — a threshold exists, the 0.5 cutoff isn’t it.
Test 3: the precision-recall curve places the 0.5 cutoff at roughly P=0.74, R=0.97. The pipeline was designed for an operating point in the P~0.93, R~0.91 band — the old operating point under the old model. For a stated target of P >= 0.95, the new derived threshold lands in the 0.65-0.70 range. The old threshold isn’t filtering enough irrelevant docs.
Test 4: gate rate fell by roughly 5pp in the week following the model bump. KL between the two weeks of production scores sits around 0.2. Both metrics fired; neither was on a dashboard.
The mechanism: a sharper score head on the new reranker admits more irrelevant docs above 0.5. The LLM receives several multiples more irrelevant context per call. Faithfulness flags rise, users complain, on-call routes to the LLM team, the LLM team blames retrieval, retrieval blames the reranker, the reranker team points at the NDCG release note. The 0.5 was nobody’s problem because the 0.5 was nobody’s number.
The remediation: derive the new threshold from the labeled set at the stated operating point, calibrate the score with isotonic regression so future thresholds are expressed in probability units, commit eval/thresholds.yaml pinned to the reranker revision, wire the gate rate onto the retrieval dashboard with a >3pp weekly alarm. LLM call volume returns to prior levels within a few days. Faithfulness flag rate falls back into the low single digits. On the next reranker bump, the CI check refuses the deploy until a new threshold is committed; the derivation runs in under two minutes; the same incident does not recur.
Treatment
The discipline that makes the threshold survive future model swaps. Order matters; each step assumes the previous.
1. Derive the threshold from the operating point, not from intuition
A threshold is a precision-vs-recall trade. The trade is a product decision; the curve is a fact. Decide the operating point first, then read the threshold off the precision-recall curve at that point.
# Inline fixture: labeled (scores, labels) on a representative set.
precision, recall, thresholds = precision_recall_curve(labels, scores)
# High-precision (expensive downstream): pick smallest t with P >= target_p.
# High-recall (cheap or human-in-the-loop): pick largest t with R >= target_r.
The derived threshold is rarely a round number. Stakeholders push back because 0.6188 looks unprincipled compared to 0.5. The reverse is true — the round number was the unprincipled one. The operating point belongs in the config alongside the threshold so the next person knows which knob they are turning.
2. Calibrate the raw score so future thresholds mean what they say
Even at a chosen operating point, the raw score still does not mean P(relevant). Platt scaling or isotonic regression against labels yields a calibration map that makes the post-calibration score interpretable as probability. After calibration, a threshold expressed in probability units survives more model swaps cleanly — the calibration absorbs distribution shift that would otherwise move the raw threshold.
from sklearn.isotonic import IsotonicRegression
iso = IsotonicRegression(out_of_bounds="clip").fit(scores_fit, labels_fit)
calibrated = iso.predict(scores_val)
# Validate on a held-out split with a reliability diagram.
After calibration, 0.7 on the calibrated score means approximately 70% probability of relevance, on this scorer, against this label set. The round number is finally allowed to be principled because it now denotes something. Calibration adds an offline step to every release; wire it into the same script that derives the threshold so neither runs without the other.
3. Add a relative gate, not just an absolute floor
A single absolute threshold over-filters easy queries (everything scores high; the gate is too permissive) and under-filters hard queries (everything scores low; the gate is too strict). A relative gate — “keep docs within ε of the top-scoring doc for this query” — corrects both.
def gate(query_scores, abs_floor=0.55, rel_gap=0.10, min_keep=1):
if not query_scores:
return []
top = max(query_scores)
cut = max(abs_floor, top - rel_gap)
kept = [s for s in query_scores if s >= cut]
return kept if len(kept) >= min_keep else sorted(query_scores, reverse=True)[:min_keep]
The absolute floor expresses a global precision target; the relative gap expresses within-query relative confidence. Together they handle query-difficulty variation that no single absolute cutoff can. Two knobs instead of one — derive each from data. The absolute floor comes from the precision-recall curve (Treatment §1); the relative gap comes from the within-query score-spread distribution on labeled queries, targeted at the 80th-percentile within-query gap.
4. Lock the threshold under version control, pinned to the model revision
Every model bump must come with a re-derived threshold. The pattern that survives team rotation and model swaps:
# eval/thresholds.yaml — committed, code-reviewed, diffed in PRs.
reranker:
model: bge-reranker-v2.5
revision: 8a1d4f9
threshold_precision_95: 0.6843
threshold_recall_90: 0.4421
calibration: isotonic_2026_q2.pkl
labeled_set_id: 2026_q2_eval_v3
derived_on: 2026-05-13
derived_by: scripts/derive_threshold.py
Plus a CI check that refuses to merge any change to the reranker revision unless the threshold file has been re-derived against it:
import yaml, os
from datetime import date
def check_threshold_matches_model(path="eval/thresholds.yaml",
env="RERANKER_REVISION"):
cfg = yaml.safe_load(open(path))
if cfg["reranker"]["revision"] != os.environ[env]:
raise RuntimeError("Reranker bumped; re-derive threshold and commit.")
age = (date.today() - date.fromisoformat(cfg["reranker"]["derived_on"])).days
if age > 90:
raise RuntimeError(f"Threshold derived {age} days ago; re-derive.")
The gate is no longer a number anybody can quietly change. It is a derivation that runs against a labeled set, pinned to a model revision, re-run on every bump. The discipline is enforced by CI, not by a doc nobody reads. Model bumps go from “change one line” to “change one line, run derivation, commit threshold file, get review.” The slowdown is the cost of the discipline. It is also the reason the discipline survives team rotation.
5. Monitor the gate rate weekly with a 3pp alarm
The gate rate — fraction of inputs filtered out by the threshold — is the metric that catches this pattern early. Plot it weekly per gate. Alarm on a >3pp change week-over-week against a 4-week trailing baseline. A gate rate climbing from 5% to 22% is the same event as users noticing a fifth of LLM calls aren’t happening, except the alarm fires three weeks earlier. The alarm is cheap; the alternative is the incident.
Pair the gate-rate scalar with the score-distribution KL between the current week and the trailing 4-week baseline (Test 4). KL catches shifts that don’t cross the gate threshold but predict that they will soon.
What does NOT work — and every team tries first
Picking a “safer” round number. Trading 0.5 for 0.6 trades one undocumented threshold for another. The next model swap will move the right answer somewhere else; the new round number will be just as wrong.
Using the model’s reported AUC to justify the threshold. AUC is ranking-quality; it is threshold-free by construction. A model with AUC 0.98 can still have any threshold be wildly miscalibrated. AUC says the precision-recall curve has area; it does not say which point on the curve the system is operating at.
Tuning the threshold to make the downstream metric look good. The downstream metric averages over the entire traffic mix; the threshold operates per-input. Threshold tuning by downstream-metric search lands on a value that maximizes the metric on the current distribution and silently breaks on the next. The threshold must be derived from labels at a stated operating point, not from a moving downstream signal.
The threshold is owed a derivation. Pick the operating point first, read the threshold off the precision-recall curve, calibrate the score, pin it to the model revision, alarm on the gate rate. In that order.
This isn’t this pattern when…
| You observe… | This is probably… | Read next |
|---|---|---|
| Threshold was correctly derived; metric and gate rate stable; users still complain | Eval set has drifted, not the score distribution | Eval drift |
| Threshold derived; gate green; downstream LLM contradicts retrieved doc | Retrieval is fine, generation is the failure | Right doc, wrong answer |
| Threshold-gated cheap stage was calibrated; expensive stage is still firing on everything | Cascade routing rule is wrong, not the threshold | Cascade saturation |
| Distilled scorer’s threshold worked at deploy, slowly drifting | Student-teacher agreement degrading on live distribution | Distillation drift |
| Threshold on intent classifier; one specialist is always picked | Routing-head miscalibration, same family as this pattern | (No playbook yet — treat as a special case of this one) |
The disambiguation rule of thumb: eval drift moves the distribution of queries; threshold-by-feel moves the distribution of scores; distillation drift moves the student-teacher agreement. Same surface symptom (a dashboard or downstream metric breaks unexpectedly), different mechanism underneath.
Numbers that matter
| signal | healthy | suspect | confirmed |
|---|---|---|---|
| threshold has a derivation script + pinned revision | yes to both | yes to one | no to both |
| precision at current cutoff vs. stated target | within 0.02 | within 0.05 | gap over 0.05 |
| gate-rate week-over-week change | under 2pp | 2-5pp | over 5pp |
| KL(this week vs prior 4 weeks) on raw scores | under 0.05 | 0.05-0.15 | over 0.15 |
| days since last threshold re-derivation | under 60 | 60-90 | over 90 |
| reliability-diagram max bin error (after calibration) | under 0.05 | 0.05-0.10 | over 0.10 |
Starting thresholds for the meta-thresholds. Tune them after two or three model-swap cycles of history; “healthy KL” depends on how stable the scorer family is across releases.
Adjacent patterns
- Eval drift: same surface symptom — “the dashboards lie” — different mechanism. Eval drift moves the query distribution; threshold-by-feel moves the score distribution. A refreshed eval combined with a drifting gate rate means both are firing.
- Cascade saturation: a cheap-then-expensive cascade depends on the cheap stage’s confidence calibration being right. Threshold-by-feel at the cheap stage is one of the most common ways a cascade silently saturates — the gate intended to send
5%of traffic to the expensive stage starts sending30%. - Distillation drift: a distilled scoring model whose teacher-agreement degrades on the live distribution surfaces as “the threshold doesn’t work anymore” even when the threshold was correctly derived. Re-deriving patches the symptom; it does not patch the underlying drift.
- Reranker on the request path: a reranker behind the gate sitting on the user-visible path compounds a miscalibrated threshold with latency budget — the slow stage runs for context the LLM was going to ignore.
