Also known as: standardized mean difference, Hedges' g, paired d
TL;DR
The standardized mean difference between two groups. Cohen's d expresses the gap between two group means in units of how spread-out each group is — the most-cited effect size for two-sample comparisons.
Cohen’s is the standardized mean difference between two groups — the gap between their means, expressed in units of the within-group standard deviation. A of means the means are one standard deviation apart. A of means the means coincide. It is the most-cited member of the effect size family, and the canonical answer to “how big is the difference between these two distributions?”
The formula
For two groups with means and a pooled standard deviation :
where the pooled SD comes from the within-group variances:
The number is unit-free — it expresses “the gap between the means” in units of “the within-group spread.” Two distributions that are one SD apart yield whether you’re measuring milliseconds, currency, or accuracy.
Cohen’s converts a difference of means into a comparison against the noise in the data. A of on a tight measurement (small within-group spread) is the same gap as a of on a noisy one — both signal that group membership explains a comparable share of the variation.
Cohen’s bands
The heuristic interpretation, from Cohen’s 1988 monograph:
Cohen's d interpretation
— negligible. The means differ by less than a fifth of a standard deviation; distributions overlap almost entirely.
— small. Detectable with adequately powered samples; barely visible on a histogram.
— medium. Visible to the eye; the modal observation in one group is clearly different from the modal observation in the other.
— large. Distributions barely overlap; nearly any randomly-drawn pair has the within-group ranking you’d expect.
— very large. Distributions essentially separated. Rare in real data; usually a sign you’re looking at a genuinely different population.
These are heuristics, not laws. A clinical trial can flag on mortality as practice-changing — the action is high-stakes, so even tiny effects matter. A product team might ignore on weekly engagement — noisy measurement, cheap to leave alone. The Cohen bands are the field-agnostic default; the right field-specific bands depend on how noisy the measurement is and what acting on a difference actually costs.
Pooled vs paired
For two systems run on the same inputs (same query set, same patient cohort, same A/B traffic split), the right denominator is the SD of the per-pair differences, not the pooled SD:
where is the per-pair difference. The denominator is much smaller than when the two systems track each other (which they typically do — item difficulty, subject-level baselines, and other shared variance terms drop out of the difference). The paired is therefore much larger than the unpaired one for the same data.
import numpy as npdef paired_cohens_d(a, b): """Cohen's d for paired samples.""" diffs = np.asarray(a) - np.asarray(b) return diffs.mean() / diffs.std(ddof=1)
Equivalently: , where is the paired -statistic. So if you already have a paired -test, you have for free.
Don’t unpair paired data — the unpaired silently absorbs shared variance into the denominator and shrinks toward zero.
A worked example
Two models, and , scored on the same 1000-item benchmark. Per-item scores are paired (same items, different model). The mean per-item difference . The SD of the per-item differences .
A medium-small effect — the mean gap is about a third of the per-item noise. With 1000 paired observations, the standard error on the mean is , so the difference clears statistical significance trivially (). But is the actionable summary: the models genuinely differ, but the gap is modest relative to the variance the test exposes.
If you only reported , you’d claim a real difference. If you only reported the mean gap of , you’d have no sense of scale. Cohen’s collapses both into a single, dimension-free number.
Beyond two groups
Cohen’s is two-group. For more groups — three sibling models, six benchmark variants, ten reranker configurations — there is no clean way to summarize all pairwise values into one scalar. The right tool is one-way ANOVA’s -statistic and its effect-size partner (eta-squared); see effect size for the multi-group treatment.
The link from to is direct: for two groups, . A of corresponds to (medium); a of corresponds to (large). The bands line up.
The connection to classical test theory is also direct: the per-item SD that determines is closely tied to the test’s reliability. A test with high Cronbach’s α has lower noise per item, which mechanically inflates for any real underlying improvement. That’s why fixing the measurement (dropping noise items, re-annotating bad labels) shows up as widened — measurement theory and effect size on the same axis.
Go further
When do I use Cohen's d vs the F-statistic?
Cohen's d is the two-group case: you have system A and system B, you want one number for 'how different are these.' For more than two groups, d doesn't generalize cleanly — use the F-statistic from one-way ANOVA and its effect-size partner . See effect size for the multi-group treatment.
If the two groups are independent (different subjects, different draws), use pooled — denominator is the pooled within-group SD. If the two groups share subjects (same query set scored by two systems), use paired — denominator is the SD of the per-pair differences. Paired d is almost always much larger than pooled d for the same data, because shared variance (query difficulty, person-level baselines) drops out. Don't unpair paired data.
Mathematically: the means differ by 0.3 of a within-group standard deviation. Geometrically: pull two histograms apart by a third of their width — they still overlap heavily, but the modes are visibly offset. Whether 0.3 is 'big enough to act on' depends entirely on the cost of acting and the noise floor of the measurement — Cohen's bands are field-agnostic defaults, not laws.