Cohen's Kappa

Q: When should I use weighted kappa vs Fleiss vs Krippendorff?

Weighted kappa for ordinal labels — graded relevance FORMULA, Likert scales — where 'off by 1' is less wrong than 'off by 3'. Quadratic weights are standard. Fleiss' kappa for >2 raters on nominal categories. Krippendorff's alpha when you have missing data, mixed rater counts per item, or non-nominal data types (interval, ratio). For LLM judge evaluation, weighted kappa on 0-3 grades is almost always the right call.

Also known as: Cohen's κ, inter-annotator agreement, kappa coefficient

TL;DR

Cohen's — observed agreement minus chance agreement, normalized. The standard inter-annotator agreement metric. Raw % agreement is misleading on imbalanced classes; kappa is the honest version.

Cohen’s kappa is the standard answer to “how much do two raters actually agree, beyond what you would expect by chance?” The formula:

where is observed agreement (fraction of items both raters labeled identically) and is expected agreement under the null where each rater labels independently according to their own marginal distribution. The numerator is the lift over chance; the denominator is the maximum possible lift. is perfect agreement, is chance-level, is worse than chance.

Why raw agreement is a trap

Suppose two annotators each label 1000 retrieval results as relevant or irrelevant, and both happen to lean “irrelevant” 95% of the time. They will agree on roughly of items purely by accident. If their actual measured agreement is 92%, that sounds great — until you compute . They are barely better than two coin flips that happen to be biased the same way.

Two annotators who both always say “yes” agree 100% of the time and tell you nothing. Cohen’s kappa is the metric that catches this; raw % agreement does not.

The interpretation table

The Landis-Koch rules of thumb are everywhere in the literature. They are heuristics, not law:

Kappa interpretation (Landis-Koch)

— worse than chance. Something is wrong; raters may have flipped a label convention.
— poor to fair. Don’t ship a benchmark on these labels.
— moderate. Borderline; usable for coarse comparisons, not for tight leaderboards.
— substantial. Fine for most ML evaluation. Human-human agreement on retrieval relevance often lives here.
— excellent. Treat the labels as gold.

For LLM-judge work, the implicit target is to push judge-judge kappa into the 0.7-0.8 band, matching the human-human ceiling. Below 0.6 the downstream NDCG numbers are dominated by judge noise, not model differences.

The kappa paradox

On highly imbalanced classes, kappa can crater even when agreement looks fine. If 98% of items are “irrelevant” and 2% are “relevant,” then is already . Even 99% raw agreement gives , but a tiny dip to 96.5% raw agreement collapses kappa to nearly zero. The metric becomes hypersensitive to a few disagreements on the rare class.

Variants you will actually use

Kappa family

Weighted kappa — for ordinal labels (graded relevance -, Likert scales). Penalizes “off by 3” more than “off by 1,” typically with quadratic weights . The right metric for graded-relevance judges .
Fleiss’ kappa — three or more raters on nominal categories. Reduces to Cohen’s kappa for two raters on balanced categories.
Krippendorff’s alpha — handles any data type (nominal, ordinal, interval, ratio), variable raters per item, and missing data. The most general option; the only one that works cleanly when each item is judged by a different subset of raters.
Gwet’s AC1 — designed to be stable under the kappa paradox; less hypersensitive to prevalence. Slowly gaining traction in clinical literature.

For ZE-style work — judges scoring retrieval results on a - scale — weighted kappa with quadratic weights is the default. You want the metric to recognize that a -vs- disagreement is a smaller error than a -vs- disagreement, because for downstream NDCG it is.

Why this matters for LLM-as-judge pipelines

The judge in an LLM-as-judge pipeline is a measurement instrument. Inter-judge kappa is the calibration check: if two runs of the same judge prompt disagree at , the resulting NDCG numbers carry enough noise to swamp real model differences.

The empirical finding from graded-relevance work : chain-of-thought before the grade lifts judge-judge kappa from roughly to on retrieval relevance — the same jump that pushes you from “moderate” into “substantial,” and into the human-human band. That single prompting change is often the difference between a usable judge and an unusable one.

Given a confusion matrix where is the count of items rater 1 labeled and rater 2 labeled , with marginals and , total :

Observed agreement (weighted) — where , for is your weighting (linear, quadratic, or custom).
Expected agreement (weighted) — — the same weighted sum under the independence assumption.
Kappa — .

For unweighted Cohen’s kappa, set $𝟙$ . For quadratic weights on a - scale, . Twenty lines of NumPy at most.

For confidence intervals, bootstrap over items (resample with replacement, recompute kappa on each resample, take the 2.5th and 97.5th percentiles). With items, expect a 95% CI roughly on a moderate-kappa estimate; you need 1000+ items to nail kappa to two decimal places. See statistical significance for the same logic applied to retrieval metrics.

When you have kappa, what do you actually do with it?

Three operational uses:

Gate the eval set. If judge-judge on a sample of items, the labels are not yet trustworthy. Iterate the prompt or the rubric until kappa clears the bar, then run the full benchmark.
Compare to human-human ceiling. Have humans dual-annotate items; compute human-human kappa. Your judge’s kappa against humans should approach this ceiling, not a fixed external target.
Track over time. Production judge drift is real — model updates, prompt edits, distribution shift. Re-measure kappa monthly on a frozen calibration set; alert if it drops by more than two standard errors.

Without kappa, you have no language to say whether your labels are good. With it, “the judge agrees with itself at and with humans at ” is a sentence you can ship. The next question — do the items themselves carry signal once the labels are trustworthy? — is the classical test theory half of the diagnostic stack: kappa gates labels, CTT gates items.

Go further

What is the 'kappa paradox' and when does it bite?

On highly imbalanced classes, kappa can be near zero even when raw agreement is 95%+. Two annotators rating 1000 documents with 98% 'irrelevant' / 2% 'relevant' will agree on most by chance alone; is close to , so the numerator collapses. The fix is to either (a) report kappa alongside class-conditional agreement, (b) switch to weighted kappa or Krippendorff's alpha, or (c) rebalance the eval set so kappa is meaningful. Don't read a low kappa as 'judges disagree' on an imbalanced set — read it as 'kappa is the wrong summary statistic here.'

Statistical significance

When should I use weighted kappa vs Fleiss vs Krippendorff?

Weighted kappa for ordinal labels — graded relevance , Likert scales — where 'off by 1' is less wrong than 'off by 3'. Quadratic weights are standard. Fleiss' kappa for >2 raters on nominal categories. Krippendorff's alpha when you have missing data, mixed rater counts per item, or non-nominal data types (interval, ratio). For LLM judge evaluation, weighted kappa on 0-3 grades is almost always the right call.

Graded relevance judge Data labeling

Why does CoT-before-grade lift kappa so much for LLM judges?

Asking the judge to write a one-paragraph rationale before emitting the grade typically lifts inter-judge kappa from ~0.55 to ~0.75 on retrieval relevance — comparable to human-human agreement. The CoT forces enumeration of which query facets the document covers, which is exactly the reasoning that separates a 2 from a 3. Without it, the judge collapses to a coarse heuristic and disagreements compound; with it, the judge's distribution over grades sharpens and aligns across runs.

LLM-as-judge Graded relevance judge

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs