Cronbach's Alpha

Also known as: Cronbach α, internal consistency reliability, α reliability

TL;DR

— the aggregate internal-consistency reliability number that falls out of CTT.

Cronbach’s α is the standard aggregate reliability statistic for a test — the single number that summarizes “do the items in this test agree they’re measuring the same thing?” It falls out of ’s variance decomposition and is, for any test with more than a handful of items, the first scalar you should report next to a leaderboard.

The formula

For a test with items, where is the variance of item across test-takers and is the variance of the total score :

The intuition is in the ratio inside the parentheses. If items are uncorrelated, exactly (sum of independent variances), the ratio is , and . If items are perfectly correlated, blows up to , the ratio shrinks to , and . The factor is a finite-sample correction.

A rearrangement makes the meaning even cleaner. Let be the average inter-item covariance and the average item variance:

— α is a function of the average inter-item covariance scaled by the test length. Items that co-vary push α up; items that don’t, drag it down.

Cronbach’s α is the ratio of “covariance among items” to “total variance.” When items measure the same construct, they co-vary, and α is high. When items measure unrelated things, they don’t, and α collapses toward zero.

Rule-of-thumb thresholds

The literature has a stable set of heuristic bands, originally from Nunnally:

Cronbach α interpretation
  • excellent. The test is internally consistent enough for high-stakes decisions; rankings are stable across item resamples.
  • good. Suitable for most research and benchmark comparisons.
  • acceptable. Usable, but rankings near the leaderboard top will be unstable.
  • questionable. Either the test is too short, the items don’t co-vary enough, or the construct is multidimensional.
  • problematic. The test does not internally agree on what it’s measuring; aggregate scores are noise-dominated.

For ML benchmarks the operational threshold is . Below that, your reported leaderboard differences are not stable to which items you happened to include.

The Spearman-Brown trap

Cronbach’s α has a built-in pathology that almost no ML eval paper acknowledges: α rises with , even when the new items are noise. The Spearman-Brown prophecy formula spells out the relationship:

where is the reliability of a -item test and is the reliability you’d predict by extending it -fold with comparable items. A -item test with extended to items predicts — a “respectable” reliability that is purely a length artifact.

The implication: two benchmarks with the same α are not equally reliable per item. A -item benchmark at may be no more diagnostic per item than a -item benchmark at . Always report alongside α — and when comparing benchmarks, compare per-item reliability via the Spearman-Brown back-calculation, not raw α.

Given an response matrix where is test-taker ‘s score on item :

  1. Per-item variances, computed across rows.
  2. Total-score variances; .
  3. Plug into the formula.

Five lines of NumPy:

import numpy as np

def cronbach_alpha(R):
    R = np.asarray(R, dtype=float)
    k = R.shape[1]
    item_var = R.var(axis=0, ddof=1).sum()
    total_var = R.sum(axis=1).var(ddof=1)
    return (k / (k - 1)) * (1 - item_var / total_var)

For per-item reliability — the unit-comparable number across benchmarks of different lengths — invert Spearman-Brown: . A -item benchmark at has per-item reliability ; a -item benchmark at has per-item reliability . The -item benchmark is more reliable per item.

For confidence intervals on α itself, bootstrap over test-takers (resample rows with replacement, recompute α, take the 2.5th and 97.5th percentiles). test-takers gives a CI roughly on a moderate α; test-takers tightens to .

Where it fails for ML evals

Cronbach’s α was developed for tests that measure a single construct — a math test, a personality dimension, a vocabulary test. Its key statistical assumption is tau-equivalence: every item measures the same latent construct with the same loading, differing only in difficulty and noise. Most ML benchmarks violate this, sometimes flagrantly:

  • Multi-domain benchmarks. covers retrieval, classification, clustering, STS — clearly not a single construct. A pooled α conflates “how well does this model embed text?” with “how well does it cluster?” and the resulting number is a length-inflated chimera. Compute α within each task family (per dataset, per task type), not over the whole suite.
  • Mixed-difficulty subsets. A test that mixes elementary and frontier items often shows high α from sheer length, masking the fact that no single item-pool window covers the whole ability range. Stratify by difficulty band and recompute.
  • Heterogeneous test-takers. α assumes test-takers come from a comparable population — variance contributions are stable. Mixing 0.6B-parameter models with 70B-parameter models in the same response matrix bloats and inflates α regardless of item quality.
  • Skewed score distributions. α is derived under approximate normality. Bimodal distributions (models that either fully solve or fully fail an item) violate the variance decomposition; the headline number becomes hard to interpret.

If anyone reports a single Cronbach’s α for an entire multi-domain benchmark, raise an eyebrow. A pooled MMLU α or a pooled MTEB α is mostly a length artifact. The honest version is per-subject (or per-dataset) α, plus the average and spread across subjects.

A practical reporting template

Strip α reporting down to four numbers and one comparison:

  1. α and . Always paired. Never report α without — that’s like reporting a p-value without sample size.
  2. Per-item reliability . The Spearman-Brown-inverted single-item reliability. This is the apples-to-apples number across benchmarks.
  3. Per-subdomain α if the benchmark is multi-domain. With spread.
  4. Bootstrap 95% CI on α. Resample test-takers; report .
  5. α-if-item-deleted for the bottom decile. Items whose removal raises α are noise items — they are reducing internal consistency. Drop them and recompute.

That last move — α-if-item-deleted — is the single most useful operational diagnostic Cronbach’s α gives you. It collapses the question “is this item helping?” into a one-number answer: if removing the item raises α, the item is hurting. Combined with low , it’s the cleanest way to identify the items worth cutting from a benchmark.

When you have α, what do you do with it?

The same three operational uses as , transposed from the inter-rater axis to the inter-item axis:

  1. Gate the benchmark. on a candidate eval set means the items don’t agree on what they’re measuring; iterate (drop noise items, add signal items, reconsider domain pooling) before publishing rankings on it.
  2. Compare to the per-item-reliability ceiling. A frontier-quality benchmark targets (per-item reliability)— roughly equivalent to at . Below that, you’re doing length-driven inflation.
  3. Track over time. As a benchmark ages and models converge near the top, shrinks, item co-variance dries up, and α drops — the formal signature of a saturating leaderboard. Re-measure annually; budget for a re-curation pass when α slips below .

A test you cannot certify with a published α is a test you cannot certify, period. The combination — per-item diagnostics, Cronbach’s α at the test level, on the labels, on the model comparisons — is the minimum measurement-theory stack a serious benchmark earns its name with. See for the consolidated checklist.

Go further

What's the Spearman-Brown effect and why does it matter for benchmark reporting?

Cronbach's α is monotone in the number of items — adding any item with non-negative item-total correlation raises , even if the item is mostly noise. The Spearman-Brown formula makes this explicit: a test of length with reliability extended to length has reliability . So a -item benchmark at may be no more reliable per item than a -item benchmark at . The actionable rule: always quote and together; raw numbers across benchmarks of different sizes are not comparable.

Why is α the wrong number for a multi-domain benchmark?

Cronbach's α assumes the items are tau-equivalent — measuring a single underlying construct with the same loading. A benchmark like MMLU explicitly violates this: math items, history items, and biology items measure different abilities. Pooling them inflates mechanically (a longer test always has higher α), but the resulting number doesn't certify that the test measures any one thing. The fix is to compute α per subject, then report both per-subject α and the average. A single α over the entire suite is at best uninformative and at worst misleading.

How does α relate to Cohen's κ?

Both are reliability statistics, but they answer different questions. Cohen's κ measures inter-rater reliability — do two annotators agree on labels above chance? Cronbach's α measures internal-consistency reliability — do the items in a single test co-vary as if measuring one construct? In an ML eval pipeline you want both: κ to certify the labels, α to certify the items. A benchmark with κ = 0.8 and α = 0.4 has good labels but a noisy or multidimensional test; a benchmark with κ = 0.3 and α = 0.9 has consistent items but the consistency is on labels nobody agrees on.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord