Spearman Correlation

Q: Why is Spearman the right correlation for ranking metrics?

NDCG, MRR, and MAP are all functions of rank position, not score magnitude. If you compare two retrievers' rankings of the same documents, what you want to know is whether their orderings agree. Spearman computes exactly that. Pearson on raw scores can be high while orderings disagree (calibration without ordering), or low while orderings perfectly match (orderings without calibration).

Also known as: Spearman rho, rank correlation

TL;DR

Spearman's ρ is Pearson correlation computed on ranks instead of raw values. It captures any monotone relationship — linear or curved — and is the correct correlation for ranking and retrieval evaluation, where what matters is order.

Spearman correlation is what you get when you take Pearson and replace the raw values with their ranks. Sort each variable, write down each observation’s rank position, then compute Pearson on the rank sequences. The result, denoted or , is again in .

The conversion to ranks is the entire content of the method. It buys you robustness to the scale of the underlying variable — you only care about order — and it generalizes from linear to any monotone relationship.

where is the difference in ranks for observation . (This shortcut formula assumes no ties; with ties, fall back to the Pearson-on-ranks definition.)

Monotone, not just linear

A function is monotone if — it preserves order, possibly non-linearly. Pearson sees only the linear piece; Spearman sees all of it. A perfectly deterministic has Pearson less than 1 but Spearman , because cubing is monotone and ranks are preserved exactly.

This is the property that matters for ranking evaluation. A reranker’s logits and a relevance label might be related by a sigmoid, a square root, or any other monotone curve. The reranker has done its job — it has the order right — even though Pearson on raw values understates how much it knows.

Why retrieval evals lean on it

NDCG , MRR , MAP, and recall-at-k are all functions of rank position. They do not look at the magnitude of the relevance scores produced by the retriever — only at the order in which documents come out. So when you ask “do retriever A and retriever B agree?”, the right correlation is rank correlation:

Spearman in retrieval and ranking

System-vs-system agreement. Run two retrievers on the same query. Spearman on the two ranked lists tells you how similar their orderings are. Pearson on raw scores would mostly be measuring whether the two systems output similar score scales, which is irrelevant.
Predicted vs ground-truth ranking. A reranker outputs scores; humans output graded labels (0-3). Spearman on the per-document (score, label) pairs is the natural per-query agreement metric. Average across queries.
Cross-encoder distillation diagnostics. Did the bi-encoder student preserve the cross-encoder teacher’s ranking? Spearman per query, then mean. If Pearson on raw scores is high but Spearman is lower, the student matched scale but lost order — usually fatal for retrieval.
Eval-set health checks. If two reasonable retrievers have Spearman near 0 on a query, your eval set probably has noisy labels or ambiguous queries; investigate before trusting either system.

Spearman vs Kendall tau

Both are rank-based. The split:

Spearman : Pearson on ranks. Penalizes rank disagreements quadratically — a rank distance of 5 contributes 25 to the sum, while five rank distances of 1 contribute 5. So Spearman is more sensitive to a few large disagreements.
Kendall : Counts pairs where both rankings agree on direction (concordant) minus pairs where they disagree (discordant), normalized by the total. Linear penalty per disagreement. More robust; more interpretable as “fraction of pairs that agree.”

For most retrieval-eval purposes either works and they give correlated answers. Tau has slightly better statistical properties (smaller variance, exact null distribution) but Spearman is the older default and gets reported more often.

Graded-relevance retrieval evals have lots of ties — a long-tail of documents with label 0 (not relevant), a smaller cluster with label 1, etc. Spearman with the standard tie correction is approximately right, but the resulting null distribution is no longer cleanly tabulated.

Kendall tau-b explicitly handles ties in the denominator. For datasets with heavy tie structure (most retrieval graded-relevance evals), tau-b is the cleaner statistic and is what frameworks like pytrec_eval report under the hood when they describe rank agreement.

In practice: report Spearman on per-query metric correlations (where ties are rare because metrics are continuous), and tau-b when you correlate raw graded labels against scores (where label ties dominate).

A note on significance

Spearman has the same kind of t-test approximation as Pearson — replace the raw values with ranks and apply the same formula — but the assumption of bivariate normality is now bivariate normality of ranks, which is essentially never satisfied (ranks are uniform, not Gaussian). For small , exact distributions are tabulated; for larger , the t-approximation is fine for two-tailed p-values but bootstrap CIs remain the more honest reporting choice. See statistical significance in retrieval evals for paired-test recipes that handle this correctly.

The discipline: when you want to report agreement between two ranked outputs, compute Spearman and bootstrap a 95% interval. Two lines of NumPy. The headline correlation alone is not enough.

Go further

Spearman or Kendall tau — which one?

Both are rank-based and both capture monotone relationships, but they answer slightly different questions. Spearman is Pearson on ranks — it weights large rank disagreements quadratically. Kendall tau counts the fraction of concordant pairs minus discordant pairs. Tau is more interpretable ('73% of pairs agree'), more robust to small numbers of disagreements, and computes in . Use Spearman if you want a drop-in replacement for Pearson; use Kendall when you need an interpretable pair-wise statistic.

Pearson correlation

Why is Spearman the right correlation for ranking metrics?

NDCG, MRR, and MAP are all functions of rank position, not score magnitude. If you compare two retrievers' rankings of the same documents, what you want to know is whether their orderings agree. Spearman computes exactly that. Pearson on raw scores can be high while orderings disagree (calibration without ordering), or low while orderings perfectly match (orderings without calibration).

NDCG@k MRR

What about ties?

Pure Spearman assumes distinct ranks. With ties, you average the ranks across the tied positions ('midrank' or 'fractional ranking') and apply a tie-correction term to the denominator. SciPy's spearmanr handles this automatically. For graded-relevance retrieval evals where labels are 0-3 with many ties, the correction matters; without it the reported ρ is biased upward.

Score calibration

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs