Why ANOVA instead of repeated t-tests?
Running
Also known as: analysis of variance, one-way ANOVA, repeated-measures ANOVA, F-test
Analysis of variance — the statistical test that asks 'do these groups differ more than within-group noise would predict?' Partitions total variation in the data into a between-group component and a within-group component, then compares them via the F-statistic.
ANOVA (analysis of variance) is the statistical test for comparing three or more groups. The question it answers is direct: does group identity explain more variation in the data than within-group noise would predict by chance? If yes, the groups are genuinely distinguishable; if no, the apparent differences in their means are within the range of random variation.
The framework was Fisher’s contribution in the 1920s, originally for agricultural experiments — does this fertilizer yield differ across varieties of wheat? It generalized into the workhorse for any multi-group comparison: clinical trials with three or more arms, A/B/C/D product tests, multi-system benchmarks.
Given
The test statistic is the ratio of mean squares — the sums of squares divided by their degrees of freedom:
where
ANOVA decomposes variation, then asks whether the between-group share is larger than the within-group noise would predict. That single ratio — the F-statistic — collapses a multi-group comparison into one test, one p-value, one decision.
The two flavors of ANOVA that matter:
ANOVA rests on three:
In practice, the dominant failure mode in evaluation is the independence assumption — running one-way ANOVA on data where every item was scored by every system (paired data masquerading as independent). The fix is always repeated-measures.
The F-test answers “is there some difference somewhere among the groups?” — a single global yes/no. It does not tell you which groups differ, or by how much. For those, you need:
For two groups, the three test statistics are tightly linked:
So a Cohen’s
For
A practical note for large-
Always report ANOVA results with their effect-size partner. A significant F-test alone, at scale, is almost meaningless; an F-test with
Running
Three: (1) within-group residuals are normally distributed, (2) groups have equal variance (homoscedasticity), (3) observations are independent. With balanced designs and reasonable sample sizes (≥30 per group), ANOVA is robust to normality violations. Equal-variance violations matter more — Welch's ANOVA fixes this for one-way designs. Independence violations (clustered or correlated observations) cause serious bias and require mixed-effects models instead.
One-way ANOVA assumes each observation comes from one group only — independent subjects. Repeated-measures ANOVA handles the case where every subject (or every item) is scored under every condition: same query set evaluated by every system, same patient measured at every time point. The subject-level variance is removed from the denominator, which sharply increases sensitivity — the same underlying effect produces a much larger F. Always use repeated-measures when the design is paired.