What's the Spearman-Brown effect and why does it matter for benchmark reporting?
Cronbach's α is monotone in the number of items
Also known as: Cronbach α, internal consistency reliability, α reliability
Cronbach’s α is the standard aggregate reliability statistic for a test — the single number that summarizes “do the items in this test agree they’re measuring the same thing?” It falls out of Classical Test Theory ’s variance decomposition and is, for any test with more than a handful of items, the first scalar you should report next to a leaderboard.
For a test with
The intuition is in the ratio inside the parentheses. If items are uncorrelated,
A rearrangement makes the meaning even cleaner. Let
— α is a function of the average inter-item covariance scaled by the test length. Items that co-vary push α up; items that don’t, drag it down.
Cronbach’s α is the ratio of “covariance among items” to “total variance.” When items measure the same construct, they co-vary, and α is high. When items measure unrelated things, they don’t, and α collapses toward zero.
The literature has a stable set of heuristic bands, originally from Nunnally:
For ML benchmarks the operational threshold is
Cronbach’s α has a built-in pathology that almost no ML eval paper acknowledges: α rises with
where
The implication: two benchmarks with the same α are not equally reliable per item. A
Given an
Five lines of NumPy:
import numpy as np
def cronbach_alpha(R):
R = np.asarray(R, dtype=float)
k = R.shape[1]
item_var = R.var(axis=0, ddof=1).sum()
total_var = R.sum(axis=1).var(ddof=1)
return (k / (k - 1)) * (1 - item_var / total_var)For per-item reliability — the unit-comparable number across benchmarks of different lengths — invert Spearman-Brown:
For confidence intervals on α itself, bootstrap over test-takers (resample rows with replacement, recompute α, take the 2.5th and 97.5th percentiles).
Cronbach’s α was developed for tests that measure a single construct — a math test, a personality dimension, a vocabulary test. Its key statistical assumption is tau-equivalence: every item measures the same latent construct with the same loading, differing only in difficulty and noise. Most ML benchmarks violate this, sometimes flagrantly:
If anyone reports a single Cronbach’s α for an entire multi-domain benchmark, raise an eyebrow. A pooled MMLU α or a pooled MTEB α is mostly a length artifact. The honest version is per-subject (or per-dataset) α, plus the average and spread across subjects.
Strip α reporting down to four numbers and one comparison:
That last move — α-if-item-deleted — is the single most useful operational diagnostic Cronbach’s α gives you. It collapses the question “is this item helping?” into a one-number answer: if removing the item raises α, the item is hurting. Combined with low point-biserial
The same three operational uses as Cohen’s κ , transposed from the inter-rater axis to the inter-item axis:
A test you cannot certify with a published α is a test you cannot certify, period. The combination — CTT per-item diagnostics, Cronbach’s α at the test level, Cohen’s κ on the labels, paired significance tests on the model comparisons — is the minimum measurement-theory stack a serious benchmark earns its name with. See eval set quality for the consolidated checklist.
Cronbach's α is monotone in the number of items
Cronbach's α assumes the items are tau-equivalent — measuring a single underlying construct with the same loading. A benchmark like MMLU explicitly violates this: math items, history items, and biology items measure different abilities. Pooling them inflates
Both are reliability statistics, but they answer different questions. Cohen's κ measures inter-rater reliability — do two annotators agree on labels above chance? Cronbach's α measures internal-consistency reliability — do the items in a single test co-vary as if measuring one construct? In an ML eval pipeline you want both: κ to certify the labels, α to certify the items. A benchmark with κ = 0.8 and α = 0.4 has good labels but a noisy or multidimensional test; a benchmark with κ = 0.3 and α = 0.9 has consistent items but the consistency is on labels nobody agrees on.