Classical Test Theory

Q: What does CTT actually let you say about a benchmark?

Per-item: difficulty FORMULA and point-biserial discrimination FORMULA. Aggregate: total-score variance, internal consistency (Cronbach's FORMULA), and the fraction of items at the difficulty extremes. From those, four diagnostic claims fall out — too many items at ceiling/floor, too many low-discrimination items, low internal consistency, low between-model variance. Computable in five lines of NumPy from the raw response matrix; no parametric assumptions.

Also known as: CTT, true-score theory, item analysis

TL;DR

The 100-year-old psychometric toolkit that ML eval research mostly ignored. Decompose every observed score into (true score + error).

Classical Test Theory (CTT) is the boring, robust, hundred-year-old psychometric framework that the ML eval literature mostly forgot. It answers most of the diagnostic questions you actually have about a benchmark — is this item too easy? is it noise? is the whole test internally consistent? — using nothing but the raw response matrix and a handful of summary statistics.

The true-score model

CTT starts from a single decomposition. Every observed score on an item (or a test) is the sum of a true score and a measurement error :

with and — the error has zero mean and is uncorrelated with the true score. That single assumption gets you the variance decomposition , and the definition of reliability:

— the fraction of observed variance that is signal rather than measurement noise. Reliability of 1 means the test perfectly orders test-takers; reliability of 0 means it’s a coin flip. Most ML benchmarks live somewhere between.

CTT is descriptive, not generative. It does not assume an item-response curve, does not assume unidimensionality, does not assume local independence, does not assume exchangeable test-takers. It just gives you per item from raw responses.

Item difficulty:

For a binary-graded item , item difficulty is the simplest possible statistic: the fraction of test-takers who got it right.

The name is the opposite of the value — high means easy, low means hard. The item’s variance under a Bernoulli model is , which is maximal at and collapses to zero as or .

This is the first CTT diagnostic for benchmark health. Items with (everyone right) or (everyone wrong) carry no information about model differences — they have zero variance to explain. A benchmark where 60% of items have is a saturated benchmark dressed up as a hard one.

Item-difficulty diagnostics

— floor item. No model gets it; carries no signal. Likely mislabeled, ambiguous, or genuinely outside the frontier.
— hard item. Discriminates strong from weak models if is healthy.
— moderate item. Maximum information region; this is where good benchmarks concentrate.
— easy item. Discriminates only at the lower end of the ability range.
— ceiling item. Everyone right; no signal between top models. Saturated.

For graded items (e.g. graded relevance on a - scale) the analog is mean grade — replace “fraction correct” with “mean score” and the same logic applies: items where every model scores 3 or every model scores 0 are dead.

Item discrimination:

The second CTT diagnostic asks a different question: do the test-takers who get this item right tend to be the ones who do well overall? This is point-biserial correlation — the Pearson correlation between the binary item score and the total test score:

where is the mean total score among test-takers who got item right and is the mean among those who got it wrong. No independence or unidimensionality assumption baked in — it’s just a Pearson correlation between two columns of the response matrix.

Heuristic thresholds, well-established in the educational measurement literature:

Point-biserial discrimination thresholds

— excellent. Strong models clearly outperform weak ones on this item.
— good. Healthy discrimination; keep.
— marginal. Borderline; investigate before keeping.
— noise item. Item-by-total correlation is too low to trust; cut it.
— backwards item. Strong test-takers do worse on this item than weak ones. Almost always a labeling error or a mis-keyed answer.

Negative discrimination is the most diagnostic single signal in CTT. If a frontier model is less likely than a weak model to get item right, the most likely explanation is that the gold label is wrong — not that the frontier model is suddenly underperforming.

The high-low method

A poor man’s discrimination, computable in your head from a ranked spreadsheet:

— fraction correct in the top 27% of test-takers minus fraction correct in the bottom 27%. The 27% threshold is a Kelley result from 1939 that maximizes discrimination-power for normally-distributed total scores; in practice anywhere between 25% and 33% gives essentially the same answer.

is excellent, is a noise item. The high-low number is monotone with point-biserial in well-behaved cases — use it when you want to eyeball a single suspicious item without firing up Python.

Why CTT survives where parametric fits don’t

CTT’s only structural assumption is that measurement error is mean-zero and uncorrelated with true score — a much weaker condition than the parametric, independence-and-unidimensionality assumptions baked into latent-trait models. On real ML evals where items cluster by topic, test-takers are correlated through shared training data, and the item pool deforms under contamination, that gap matters.

The price: CTT can’t give you per-test-taker ability on a calibrated cross-study scale. The gain: , , and Cronbach’s are summary statistics of the response matrix, computable in five lines of NumPy on any benchmark, and they don’t depend on a model fit converging or its assumptions holding.

The “CTT first” pipeline. Score every model on every item. For each item, compute and . Drop items with , , or . Re-score on the survivors. The shorter, sharper benchmark that falls out has higher between-model variance, larger within-family effect sizes, and tighter confidence intervals than the original — typically with a fraction of the items.

A worked diagnostic

Imagine a -item benchmark, models scored on it. You tabulate and for each item:

items have — every model gets them. Drop them; they contribute no variance.
items have — no model gets them. Drop them too, or investigate for label errors.
items have — they correlate weakly with total score. Probably noise; drop or fix.
items have — likely mislabeled. Re-annotate.

After the cuts, you have items doing real measurement work. Recompute the aggregate: Cronbach’s α on the trimmed test will typically rise even though dropped, because the noise items were subtracting from reliability. This is the single highest-leverage move in eval curation, and CTT is what tells you which items to cut.

Connection to the inter-rater layer

CTT diagnoses the test. Cohen’s κ diagnoses the labels the test depends on. They are complementary: a test where the item-level are healthy but judge-judge κ is is a test on top of unreliable labels — the discrimination is real but the ranking is between models that are noisily mis-scored. Always layer the two: κ to gate the labels, CTT to gate the items, then paired significance tests to gate the model comparisons.

THE CTT SHIT-DATASET DIAGNOSTIC

A benchmark is in trouble — and probably needs re-annotation, item-pruning, or replacement — when any of these flags fire on its response matrix:

High fraction of items with or — the test is saturated at one end or the other; no signal at the frontier.
High fraction of items with — too much of the test is noise; aggregate scores are dominated by which items each model got lucky on.
Aggregate Cronbach’s α below — the test does not internally agree on what it’s measuring; rankings are not stable across resamples of items.
More than a handful of items with — labeling errors that flip the sign of the discrimination signal.

Run this checklist on a candidate benchmark before you trust any leaderboard built on it. See eval set quality for the full diagnostic stack.

Go further

What does CTT actually let you say about a benchmark?

Per-item: difficulty and point-biserial discrimination . Aggregate: total-score variance, internal consistency (Cronbach's ), and the fraction of items at the difficulty extremes. From those, four diagnostic claims fall out — too many items at ceiling/floor, too many low-discrimination items, low internal consistency, low between-model variance. Computable in five lines of NumPy from the raw response matrix; no parametric assumptions.

Cronbach's alpha Eval set quality

What does the high-low method give you that the full point-biserial doesn't?

Nothing statistically — point-biserial uses every test-taker, the high-low method uses only the top and bottom 27%. But the high-low number is trivial to compute by eye on a ranked spreadsheet, and it survives weird score distributions where Pearson assumptions wobble. In practice, report point-biserial; reach for high-low when you want a back-of-envelope read on a single suspicious item.

Pearson correlation

How does CTT relate to Cronbach's α?

Cronbach's α is the aggregate reliability number that falls out of CTT's variance decomposition — it estimates from the covariance structure among items. CTT gives you per-item diagnostics (, ); Cronbach's α gives you the whole-test number. Use both: CTT to find which items to drop, α to certify the post-drop test.

Cronbach's alpha Cohen's kappa

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs