Why Evaluation Metrics for Reranking Matter in Search Engine Performance
Introduction
When you search for “best headphones for running,” you expect relevant results at the top. But how do we measure whether a reranking algorithm is actually improving search quality?
Evaluation metrics are quantitative measures that tell us:
- Are the right results appearing in the top positions?
- How many relevant results did we find?
- How well does the ranking match user expectations?
This guide covers the most important metrics used in production search systems, with working Python code for each one.
What Is Reranking and Why Does It Happen?
Reranking is the process of reordering initial search results to improve overall search accuracy. Even if a search engine finds relevant pages, the initial ranking may not be perfect. Reranking steps in to reorder the results so that the most accurate, useful, and high-quality content is shown first. This is especially important in AI-powered search systems, recommendation engines, and e-commerce platforms, where showing the right item first can make all the difference.
Why Evaluation Metrics Are Needed
Without proper evaluation, there’s no way to know if reranking is actually making the results better. Evaluation metrics are used to measure how close the reordered results are to what users want. They help developers figure out whether the changes in ranking lead to improved accuracy, higher user satisfaction, and better click-through rates.
Common Metrics Used in Reranking
Some popular metrics include:
- Precision and Recall: Precision checks how many of the top results are relevant, while recall checks how many relevant results were found in total.
- Mean Reciprocal Rank (MRR): This measures how high the first correct result appears in the ranking.
- Normalized Discounted Cumulative Gain (NDCG): A more advanced metric that gives higher value to relevant results appearing at the top of the list.
These metrics give developers a clear way to compare different reranking algorithms and choose the one that delivers the best experience for users.
Precision @ K
What It Measures
Precision@K answers: “Of the top K results I returned, what percentage are actually relevant?”
This metric focuses on quality over quantity - it only cares about whether you’re showing relevant results in the top K positions.
Formula
Precision@K = (Number of Relevant Documents in Top K) / K
Python Implementation
def precision_at_k(results: list[dict], k: int) -> float:
"""
Calculate Precision@K
Args:
results: List of dicts with 'doc_id' and 'relevant' (bool)
k: Number of top results to consider
Returns:
Precision score (0.0 to 1.0)
"""
if k <= 0 or len(results) == 0:
return 0.0
top_k = results[:k]
relevant_count = sum(1 for doc in top_k if doc['relevant'])
return relevant_count / k
Real-World Example
Query: “best headphones for running”
| Rank | Document | Relevant? |
|---|---|---|
| 1 | Waterproof Sport Earbuds Review | ✅ YES |
| 2 | Best Running Headphones 2024 | ✅ YES |
| 3 | Office Headphones Comparison | ❌ NO |
| 4 | Wireless Earbuds for Athletes | ✅ YES |
| 5 | Gaming Headset Guide | ❌ NO |
Calculation:
- Relevant in top 5: 3 documents
- Precision@5 = 3/5 = 0.60 (60%)
Key Insights
Recall @ K
What It Measures
Recall@K answers: “Of all relevant documents that exist, what percentage did I find in the top K results?”
This metric focuses on completeness - are you finding all the relevant results?
Formula
Recall@K = (Number of Relevant Documents in Top K) / (Total Relevant Documents)
Python Implementation
def recall_at_k(results: list[dict], k: int, total_relevant: int) -> float:
"""
Calculate Recall@K
Args:
results: List of dicts with 'doc_id' and 'relevant' (bool)
k: Number of top results to consider
total_relevant: Total number of relevant docs in entire corpus
Returns:
Recall score (0.0 to 1.0)
"""
if total_relevant <= 0 or k <= 0:
return 0.0
top_k = results[:k]
relevant_found = sum(1 for doc in top_k if doc['relevant'])
return relevant_found / total_relevant
Real-World Example
Scenario: Database contains 10 relevant documents about “running headphones”
Your search returns:
- Top 5 results: Found 3 relevant docs → Recall@5 = 3/10 = 0.30 (30%)
- Top 10 results: Found 6 relevant docs → Recall@10 = 6/10 = 0.60 (60%)
- Top 20 results: Found 8 relevant docs → Recall@20 = 8/10 = 0.80 (80%)
Key Insights
F1 Score @ K
What It Measures
F1@K is the harmonic mean of Precision@K and Recall@K. It balances both metrics, giving you a single score that accounts for both quality and completeness.
Formula
F1@K = 2 × (Precision@K × Recall@K) / (Precision@K + Recall@K)
Python Implementation
def f1_at_k(results: list[dict], k: int, total_relevant: int) -> float:
"""
Calculate F1 Score@K
Args:
results: List of dicts with 'doc_id' and 'relevant' (bool)
k: Number of top results to consider
total_relevant: Total number of relevant docs in corpus
Returns:
F1 score (0.0 to 1.0)
"""
precision = precision_at_k(results, k)
recall = recall_at_k(results, k, total_relevant)
if precision + recall == 0:
return 0.0
f1 = 2 * (precision * recall) / (precision + recall)
return f1
Real-World Example
Given:
- Precision@5 = 0.60 (3 out of 5 results are relevant)
- Recall@5 = 0.30 (found 3 out of 10 total relevant docs)
Calculation:
F1@5 = 2 × (0.60 × 0.30) / (0.60 + 0.30)
= 2 × 0.18 / 0.90
= 0.36 / 0.90
= 0.40
Key Insights
Mean Reciprocal Rank (MRR)
What It Measures
MRR answers: “How high does the first relevant result appear in my ranking?”
This metric is perfect for scenarios where users need ONE good answer (e.g., question answering, navigational search).
Formula
For a single query:
RR = 1 / (rank of first relevant document)
For multiple queries:
MRR = (1/Q) × Σ(1 / rank_i)
where Q = number of queries, rank_i = position of first relevant result for query i
Python Implementation
def reciprocal_rank(results: list[dict]) -> float:
"""
Calculate Reciprocal Rank for a single query
Args:
results: List of dicts with 'doc_id' and 'relevant' (bool)
Returns:
Reciprocal rank (0.0 to 1.0)
"""
for rank, doc in enumerate(results, start=1):
if doc['relevant']:
return 1.0 / rank
return 0.0 # No relevant results found
def mean_reciprocal_rank(queries_results: list[list[dict]]) -> float:
"""
Calculate MRR across multiple queries
Args:
queries_results: List of result lists, one per query
Returns:
Mean reciprocal rank (0.0 to 1.0)
"""
if not queries_results:
return 0.0
rr_sum = sum(reciprocal_rank(results) for results in queries_results)
return rr_sum / len(queries_results)
Real-World Example
Query 1: “capital of France”
- Rank 1: Paris → RR = 1/1 = 1.0 ✨
Query 2: “python tutorial”
- Rank 1: Ruby guide (irrelevant)
- Rank 2: Python docs → RR = 1/2 = 0.5
Query 3: “best pizza NYC”
- Rank 1: LA restaurants (irrelevant)
- Rank 2: Chicago pizza (irrelevant)
- Rank 3: NYC pizza guide → RR = 1/3 = 0.333
MRR = (1.0 + 0.5 + 0.333) / 3 = 0.611
Key Insights
Normalized Discounted Cumulative Gain (NDCG)
What It Measures
NDCG is the gold standard for ranking evaluation. Unlike previous metrics that treat all relevant documents equally, NDCG allows for graded relevance (e.g., highly relevant, somewhat relevant, not relevant) and heavily penalizes placing relevant docs lower in the ranking.
Formula
Step 1 - Discounted Cumulative Gain (DCG):
DCG@K = Σ(rel_i / log2(i + 1))
where:
rel_i = relevance score of document at position i
i = rank position (1, 2, 3, ...)
log2(i + 1) = discount factor (penalizes lower positions)
Step 2 - Ideal DCG (IDCG):
IDCG@K = DCG of the perfect ranking (all docs sorted by relevance)
Step 3 - Normalized DCG:
NDCG@K = DCG@K / IDCG@K Python Implementation
import math
from typing import List
def dcg_at_k(relevances: List[float], k: int) -> float:
"""
Calculate Discounted Cumulative Gain at K
Args:
relevances: List of relevance scores (higher = more relevant)
k: Number of top results to consider
Returns:
DCG score
"""
dcg = 0.0
for i, rel in enumerate(relevances[:k], start=1):
dcg += rel / math.log2(i + 1)
return dcg
def ndcg_at_k(relevances: List[float], k: int) -> float:
"""
Calculate Normalized Discounted Cumulative Gain at K
Args:
relevances: List of relevance scores in retrieved order
k: Number of top results to consider
Returns:
NDCG score (0.0 to 1.0)
"""
# Calculate DCG for actual ranking
dcg = dcg_at_k(relevances, k)
# Calculate IDCG (ideal ranking - sorted by relevance descending)
ideal_relevances = sorted(relevances, reverse=True)
idcg = dcg_at_k(ideal_relevances, k)
if idcg == 0:
return 0.0
return dcg / idcg
Real-World Example
Query: “best laptop for programming”
Your system returns (with graded relevance 0-3):
| Rank | Document | Relevance | Discount (1/log2(rank+1)) | Contribution |
|---|---|---|---|---|
| 1 | ”Top Programming Laptops 2024” | 3 | 1/log2(2) = 1.000 | 3.000 |
| 2 | ”Developer Laptop Guide” | 2 | 1/log2(3) = 0.631 | 1.262 |
| 3 | ”Gaming Laptop Review” | 0 | 1/log2(4) = 0.500 | 0.000 |
| 4 | ”Budget Coding Laptops” | 1 | 1/log2(5) = 0.431 | 0.431 |
| 5 | ”MacBook Pro for Developers” | 2 | 1/log2(6) = 0.387 | 0.774 |
DCG@5 = 3.000 + 1.262 + 0.000 + 0.431 + 0.774 = 5.467
Ideal ranking (sorted by relevance): [3, 2, 2, 1, 0]
| Rank | Relevance | Discount | Contribution |
|---|---|---|---|
| 1 | 3 | 1.000 | 3.000 |
| 2 | 2 | 0.631 | 1.262 |
| 3 | 2 | 0.500 | 1.000 |
| 4 | 1 | 0.431 | 0.431 |
| 5 | 0 | 0.387 | 0.000 |
IDCG@5 = 3.000 + 1.262 + 1.000 + 0.431 + 0.000 = 5.693
NDCG@5 = 5.467 / 5.693 = 0.960 (96%)
This is a very good ranking! The system is close to optimal.
Key Insights
Final Thoughts
In the world of search engines, delivering the right result at the right time is everything. Evaluation metrics provide the tools to measure and improve that ability. By tracking these metrics, developers can fine-tune their systems to make sure users always get the best possible results.
