ANOVA

Also known as: analysis of variance, one-way ANOVA, repeated-measures ANOVA, F-test

TL;DR

Analysis of variance — the statistical test that asks 'do these groups differ more than within-group noise would predict?' Partitions total variation in the data into a between-group component and a within-group component, then compares them via the F-statistic.

ANOVA (analysis of variance) is the statistical test for comparing three or more groups. The question it answers is direct: does group identity explain more variation in the data than within-group noise would predict by chance? If yes, the groups are genuinely distinguishable; if no, the apparent differences in their means are within the range of random variation.

The framework was Fisher’s contribution in the 1920s, originally for agricultural experiments — does this fertilizer yield differ across varieties of wheat? It generalized into the workhorse for any multi-group comparison: clinical trials with three or more arms, A/B/C/D product tests, multi-system benchmarks.

The variance partition

Given groups with observations each, total variation in the data decomposes into two pieces:

  • — variation attributable to group means differing from the grand mean. Large when groups are pulled apart.
  • — variation of observations around their own group’s mean. The “noise” within each group.

The test statistic is the ratio of mean squares — the sums of squares divided by their degrees of freedom:

where is the total number of observations. Under the null hypothesis (all group means equal), follows an F-distribution with degrees of freedom. A large observed — one in the upper tail of that distribution — rejects the null.

ANOVA decomposes variation, then asks whether the between-group share is larger than the within-group noise would predict. That single ratio — the F-statistic — collapses a multi-group comparison into one test, one p-value, one decision.

One-way vs repeated-measures

The two flavors of ANOVA that matter:

  • One-way ANOVA — independent subjects across groups. Each observation belongs to one group only. The denominator is the pooled within-group variance.
  • Repeated-measures ANOVA — every subject is observed under every condition. Same patients measured at every time point; same query set scored by every system. The subject-level variance is removed from the denominator (because it’s shared across conditions), which sharply increases sensitivity. The same underlying effect produces a much larger F.

Assumptions

ANOVA rests on three:

ANOVA assumptions
  • Normality of within-group residuals. Robust to mild violations at ; the central limit theorem rescues you for the test statistic.
  • Homoscedasticity — equal variance across groups. Violations matter more. Welch’s ANOVA fixes this for one-way designs by adjusting degrees of freedom.
  • Independence of observations. Violations (clustered data, repeated measures, autocorrelation) bias the test seriously. Use mixed-effects models when independence is genuinely violated and the design isn’t a clean repeated-measures.

In practice, the dominant failure mode in evaluation is the independence assumption — running one-way ANOVA on data where every item was scored by every system (paired data masquerading as independent). The fix is always repeated-measures.

What ANOVA does NOT tell you

The F-test answers “is there some difference somewhere among the groups?” — a single global yes/no. It does not tell you which groups differ, or by how much. For those, you need:

  • Pairwise comparisons — Tukey’s HSD or similar, with multiplicity correction, to identify which specific pairs differ.
  • — typically (eta-squared) or partial , the share of total variance explained by group identity. Tells you how big the effect is, in a unit-free, sample-size-independent way. For the two-group special case, use .

For two groups, the three test statistics are tightly linked:

So a Cohen’s of on two groups corresponds to (medium) and . The bands line up: Cohen’s “medium” two-group difference is the same effect size as a “medium” multi-group .

For , the relationship breaks — there’s no single across multiple groups, only pairwise values. That’s the practical reason ANOVA exists: to summarize a many-pair design with one number.

Why this matters at scale

A practical note for large- evaluation: the F-test, like every parametric test, becomes trivially significant when is large enough. With items across four systems, even a effect — group identity explains half a percent of the variance — produces a wildly significant F-statistic. The test is doing exactly what it’s supposed to, but the answer it gives is “yes, the groups differ” when the practical reality is “barely.”

Always report ANOVA results with their effect-size partner. A significant F-test alone, at scale, is almost meaningless; an F-test with is a meaningful difference; an F-test with is a difference that’s real but small enough to ignore. See for the calibration.

Go further

Why ANOVA instead of repeated t-tests?

Running pairwise t-tests across groups inflates the false-positive rate — with you have 10 comparisons, and a 5% per-comparison error rate means a ~40% family-wise error rate under the null. ANOVA gives you a single global test: is there any difference across the groups? You only descend to pairwise comparisons (with multiplicity correction) if the global F-test rejects.

What are the assumptions, and when do they bite?

Three: (1) within-group residuals are normally distributed, (2) groups have equal variance (homoscedasticity), (3) observations are independent. With balanced designs and reasonable sample sizes (≥30 per group), ANOVA is robust to normality violations. Equal-variance violations matter more — Welch's ANOVA fixes this for one-way designs. Independence violations (clustered or correlated observations) cause serious bias and require mixed-effects models instead.

How does repeated-measures ANOVA differ from one-way ANOVA?

One-way ANOVA assumes each observation comes from one group only — independent subjects. Repeated-measures ANOVA handles the case where every subject (or every item) is scored under every condition: same query set evaluated by every system, same patient measured at every time point. The subject-level variance is removed from the denominator, which sharply increases sensitivity — the same underlying effect produces a much larger F. Always use repeated-measures when the design is paired.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord