Effect Size

Also known as: F-statistic, eta-squared, η², effect size measure

TL;DR

The complement to p-values. A p-value tells you whether a difference is unlikely under the null; an effect size tells you how big it is. For multi-group designs, the F-statistic and η² are the workhorses.

A p-value answers “is this difference unlikely under the null?” An effect size answers “how big is this difference?” These are not the same question, and conflating them is the single most common mistake in evaluation reporting.

COHEN'S d · STANDARDIZED MEAN DIFFERENCEHow far apart are two distributions, in σ units?-4-3-2-1012340.00.10.20.30.4x (in σ units)densityμAμBd = 0.00 σ1 σd=μAμBσpooled=0.000.001.00=0.00OVERLAP100%COHEN INTERPRETATION · |d|0 — 2 σNEGLIGIBLE0.00.2SMALL0.20.5MEDIUM0.50.8LARGE0.81.2VERY LARGE1.22.0d = 0.00

The effect-size family covers several measures, picked by study design. For two groups, the standard is — the standardized mean difference. For more than two groups, the workhorses are the F-statistic and (eta-squared) from one-way ANOVA. This page is about the multi-group case, which is the common one when comparing more than two systems on the same benchmark.

Effect size collapses “how big is the difference” into a unit-free scalar. With multi-group designs, that scalar is the share of variance explained by group identity — exactly what measures, and what the F-statistic tests for non-zero.

The F-statistic

When you have groups — three sibling models, six benchmark methodologies, ten reranker variants — the F-statistic from one-way is the ratio of between-group variance to within-group variance:

High means group identity explains a lot of variation — the groups are genuinely distinguishable against the within-group spread. Low means the groups are similar relative to within-group noise. The F-statistic has a known sampling distribution under the null hypothesis of “no group difference,” so it doubles as a significance test — but the magnitude of is what answers “are the groups far apart relative to noise.”

A practical note: scales with sample size. The same underlying group separation produces a bigger with more data, just as the same gap produces a smaller p-value with more data. is therefore not directly comparable across studies of different size — which is why exists.

η² — the fraction of variance explained

The effect-size partner of is (eta-squared) — the fraction of total variance explained by group membership:

where denotes sum of squares. This is the multi-group analog of in regression: it answers “what fraction of the variation in scores is attributable to which group an observation belongs to?” Unlike , doesn’t grow with — it estimates a population quantity, so it’s comparable across studies.

Cohen’s heuristic bands for :

η² interpretation
  • negligible. Group membership barely predicts outcomes.
  • small. About 1% of variance attributable to group.
  • medium. 6% of variance explained.
  • large. 14%+ of variance attributable to group identity.

These are heuristics, not laws. A clinical trial can flag on mortality as practice-changing — the action is high-stakes, so even tiny effects matter. A product team might ignore on weekly engagement — noisy measurement, cheap to leave alone. The Cohen bands are the field-agnostic default; the right field-specific bands depend on how noisy the measurement is and what acting on a difference actually costs.

For repeated-measures designs — same items scored by every group — use partial instead of plain :

The denominator excludes the subject-level (or item-level) variance term, which is shared across groups and would otherwise dilute the effect. For four-or-more siblings scored on the same query set, is medium, is large — same bands as plain , but computed on the right denominator.

This matters: plain on paired data systematically understates the effect, because per-item difficulty is folded into the within-group variance instead of being credited as a separate variance term.

Why this matters at scale

A common failure mode in evaluation: with thousands of items, statistical significance is trivial. The standard error of the mean scales as , so for items, the per-system SE on a metric like accuracy is roughly . A gap between two models clears trivially — not because the models are meaningfully different, but because the sample size makes the null hypothesis impossibly precise.

Discrimination across datasets

Effect size unlocks a question that significance testing alone cannot answer: which dataset is more discriminating?

Take the same systems, run them on dataset and dataset , compute for each. The dataset where is larger is the more discriminating one at the granularity that matters — “can this benchmark tell the systems apart?” If sibling models all cluster within a single SD of each other on dataset () but spread out under dataset (), is the better measurement.

What good and bad benchmarks look like
  • Good benchmark. Sibling models in a family produce — group identity explains 14%+ of per-item variance. The benchmark resolves the scaling curve.
  • Saturated benchmark. Sibling models cluster: . The benchmark cannot tell them apart; either it’s saturated, the items are dead, or it measures something orthogonal to scale.
  • Inverted benchmark. The ordering of group means runs against expectation — smaller models beat bigger ones. Usually a sign of contamination at the small-model end, or a benchmark that rewards a specific failure mode (refusal, brevity).

This is the formal statistical signature behind any “the test got better” claim: widened. Sibling systems that previously sat indistinguishable now spread out under a more-discriminating measurement. That’s not a vibes claim — it’s an effect-size claim.

Reporting discipline

The minimum honest report for a multi-system evaluation:

  1. Per-group metric mean — the headline number.
  2. Standard error.
  3. F-statistic — with degrees of freedom .
  4. or — the effect-size partner of .
  5. p-value — from the F-test, for completeness.

For two-system comparisons specifically, use instead — it’s the natural two-group case, and it relates to directly ( for two groups).

The connection to is direct: the within-group variance in is closely tied to the test’s reliability. A test with high has lower per-item noise, which mechanically inflates and for any real underlying difference. That’s why fixing the measurement (dropping noise items, re-annotating bad labels) shows up as widened effect sizes — measurement theory and effect-size theory on the same axis. See for the integrated diagnostic.

Go further

What's the difference between effect size and statistical significance?

Significance is about whether the difference is real (unlikely under the null). Effect size is about whether the difference is big. They are independent: with enough samples, you can have on an effect-size of (statistically real, practically meaningless), or on a huge effect that the sample can't certify. At large the first failure dominates — significance is trivial, so always report effect size alongside it.

Which effect size do I use for which design?

Two groups: Cohen's . More than two groups: the F-statistic and from one-way ANOVA — these answer 'how much of the variance is explained by which group an observation belongs to,' which is exactly the right question when comparing more than two models. For paired data (same items across conditions), use partial from repeated-measures ANOVA to remove subject-level variance from the denominator.

When is a small effect size still meaningful?

When the cost of action is low and the cost of inaction is large. A 1% lift on a metric driving millions of decisions can be a real win — aggregate value dwarfs the per-instance effect. Conversely, a large effect that requires 10× the inference budget is often unshippable. Effect size is a unit-free scalar; the shipping decision needs unit-bearing context — cost, latency, business value.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord