Why Evaluation Metrics for Reranking Matter in Search Quality

Aug 12, 2025 · GitHub Twitter Slack LinkedIn Discord
Why Evaluation Metrics for Reranking Matter in Search Quality

Why Evaluation Metrics for Reranking Matter in Search Engine Performance

Introduction

When you search for “best headphones for running,” you expect relevant results at the top. But how do we measure whether a reranking algorithm is actually improving search quality?

Evaluation metrics are quantitative measures that tell us:

  • Are the right results appearing in the top positions?
  • How many relevant results did we find?
  • How well does the ranking match user expectations?

This guide covers the most important metrics used in production search systems, with working Python code for each one.

What Is Reranking and Why Does It Happen?

Reranking is the process of reordering initial search results to improve overall search accuracy. Even if a search engine finds relevant pages, the initial ranking may not be perfect. Reranking steps in to reorder the results so that the most accurate, useful, and high-quality content is shown first. This is especially important in AI-powered search systems, recommendation engines, and e-commerce platforms, where showing the right item first can make all the difference.

Why Evaluation Metrics Are Needed

Without proper evaluation, there’s no way to know if reranking is actually making the results better. Evaluation metrics are used to measure how close the reordered results are to what users want. They help developers figure out whether the changes in ranking lead to improved accuracy, higher user satisfaction, and better click-through rates.

Common Metrics Used in Reranking

Some popular metrics include:

  • Precision and Recall: Precision checks how many of the top results are relevant, while recall checks how many relevant results were found in total.
  • Mean Reciprocal Rank (MRR): This measures how high the first correct result appears in the ranking.
  • Normalized Discounted Cumulative Gain (NDCG): A more advanced metric that gives higher value to relevant results appearing at the top of the list.

These metrics give developers a clear way to compare different reranking algorithms and choose the one that delivers the best experience for users.

Precision @ K

What It Measures

Precision@K answers: “Of the top K results I returned, what percentage are actually relevant?”

This metric focuses on quality over quantity - it only cares about whether you’re showing relevant results in the top K positions.

Formula

Precision@K = (Number of Relevant Documents in Top K) / K

Python Implementation

def precision_at_k(results: list[dict], k: int) -> float:
    """
    Calculate Precision@K
    Args:
        results: List of dicts with 'doc_id' and 'relevant' (bool)
        k: Number of top results to consider

    Returns:
        Precision score (0.0 to 1.0)
    """
    if k <= 0 or len(results) == 0:
        return 0.0

    top_k = results[:k]
    relevant_count = sum(1 for doc in top_k if doc['relevant'])

    return relevant_count / k

Real-World Example

Query: “best headphones for running”

RankDocumentRelevant?
1Waterproof Sport Earbuds Review✅ YES
2Best Running Headphones 2024✅ YES
3Office Headphones Comparison❌ NO
4Wireless Earbuds for Athletes✅ YES
5Gaming Headset Guide❌ NO

Calculation:

  • Relevant in top 5: 3 documents
  • Precision@5 = 3/5 = 0.60 (60%)

Key Insights

Recall @ K

What It Measures

Recall@K answers: “Of all relevant documents that exist, what percentage did I find in the top K results?”

This metric focuses on completeness - are you finding all the relevant results?

Formula

Recall@K = (Number of Relevant Documents in Top K) / (Total Relevant Documents)

Python Implementation

def recall_at_k(results: list[dict], k: int, total_relevant: int) -> float:
    """
    Calculate Recall@K
    Args:
        results: List of dicts with 'doc_id' and 'relevant' (bool)
        k: Number of top results to consider
        total_relevant: Total number of relevant docs in entire corpus

    Returns:
        Recall score (0.0 to 1.0)
    """
    if total_relevant <= 0 or k <= 0:
        return 0.0

    top_k = results[:k]
    relevant_found = sum(1 for doc in top_k if doc['relevant'])

    return relevant_found / total_relevant

Real-World Example

Scenario

Scenario: Database contains 10 relevant documents about “running headphones”

Your search returns:

  • Top 5 results: Found 3 relevant docs → Recall@5 = 3/10 = 0.30 (30%)
  • Top 10 results: Found 6 relevant docs → Recall@10 = 6/10 = 0.60 (60%)
  • Top 20 results: Found 8 relevant docs → Recall@20 = 8/10 = 0.80 (80%)

Key Insights

F1 Score @ K

What It Measures

F1@K is the harmonic mean of Precision@K and Recall@K. It balances both metrics, giving you a single score that accounts for both quality and completeness.

Formula

F1@K = 2 × (Precision@K × Recall@K) / (Precision@K + Recall@K)

Python Implementation

def f1_at_k(results: list[dict], k: int, total_relevant: int) -> float:
    """
    Calculate F1 Score@K
    Args:
        results: List of dicts with 'doc_id' and 'relevant' (bool)
        k: Number of top results to consider
        total_relevant: Total number of relevant docs in corpus

    Returns:
        F1 score (0.0 to 1.0)
    """
    precision = precision_at_k(results, k)
    recall = recall_at_k(results, k, total_relevant)

    if precision + recall == 0:
        return 0.0

    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

Real-World Example

Given:

  • Precision@5 = 0.60 (3 out of 5 results are relevant)
  • Recall@5 = 0.30 (found 3 out of 10 total relevant docs)

Calculation:

F1@5 = 2 × (0.60 × 0.30) / (0.60 + 0.30)
     = 2 × 0.18 / 0.90
     = 0.36 / 0.90
     = 0.40

Key Insights

Mean Reciprocal Rank (MRR)

What It Measures

MRR answers: “How high does the first relevant result appear in my ranking?”

This metric is perfect for scenarios where users need ONE good answer (e.g., question answering, navigational search).

Formula

For a single query:
RR = 1 / (rank of first relevant document)
For multiple queries:
MRR = (1/Q) × Σ(1 / rank_i)
where Q = number of queries, rank_i = position of first relevant result for query i

Python Implementation

def reciprocal_rank(results: list[dict]) -> float:
    """
    Calculate Reciprocal Rank for a single query

    Args:
        results: List of dicts with 'doc_id' and 'relevant' (bool)

    Returns:
        Reciprocal rank (0.0 to 1.0)
    """
    for rank, doc in enumerate(results, start=1):
        if doc['relevant']:
            return 1.0 / rank

    return 0.0  # No relevant results found

def mean_reciprocal_rank(queries_results: list[list[dict]]) -> float:
    """
    Calculate MRR across multiple queries

    Args:
        queries_results: List of result lists, one per query

    Returns:
        Mean reciprocal rank (0.0 to 1.0)
    """
    if not queries_results:
        return 0.0

    rr_sum = sum(reciprocal_rank(results) for results in queries_results)
    return rr_sum / len(queries_results)

Real-World Example

Queries

Query 1: “capital of France”

  • Rank 1: Paris → RR = 1/1 = 1.0

Query 2: “python tutorial”

  • Rank 1: Ruby guide (irrelevant)
  • Rank 2: Python docs → RR = 1/2 = 0.5

Query 3: “best pizza NYC”

  • Rank 1: LA restaurants (irrelevant)
  • Rank 2: Chicago pizza (irrelevant)
  • Rank 3: NYC pizza guide → RR = 1/3 = 0.333

MRR = (1.0 + 0.5 + 0.333) / 3 = 0.611

Key Insights

Normalized Discounted Cumulative Gain (NDCG)

What It Measures

NDCG is the gold standard for ranking evaluation. Unlike previous metrics that treat all relevant documents equally, NDCG allows for graded relevance (e.g., highly relevant, somewhat relevant, not relevant) and heavily penalizes placing relevant docs lower in the ranking.

Formula

Step 1 - Discounted Cumulative Gain (DCG):
DCG@K = Σ(rel_i / log2(i + 1))
where:
rel_i = relevance score of document at position i
i = rank position (1, 2, 3, ...)
log2(i + 1) = discount factor (penalizes lower positions)
Step 2 - Ideal DCG (IDCG):
IDCG@K = DCG of the perfect ranking (all docs sorted by relevance)
Step 3 - Normalized DCG:
NDCG@K = DCG@K / IDCG@K

Python Implementation

import math
from typing import List

def dcg_at_k(relevances: List[float], k: int) -> float:
    """
    Calculate Discounted Cumulative Gain at K

    Args:
        relevances: List of relevance scores (higher = more relevant)
        k: Number of top results to consider

    Returns:
        DCG score
    """
    dcg = 0.0
    for i, rel in enumerate(relevances[:k], start=1):
        dcg += rel / math.log2(i + 1)

    return dcg

def ndcg_at_k(relevances: List[float], k: int) -> float:
    """
    Calculate Normalized Discounted Cumulative Gain at K

    Args:
        relevances: List of relevance scores in retrieved order
        k: Number of top results to consider

    Returns:
        NDCG score (0.0 to 1.0)
    """
    # Calculate DCG for actual ranking
    dcg = dcg_at_k(relevances, k)

    # Calculate IDCG (ideal ranking - sorted by relevance descending)
    ideal_relevances = sorted(relevances, reverse=True)
    idcg = dcg_at_k(ideal_relevances, k)

    if idcg == 0:
        return 0.0

    return dcg / idcg

Real-World Example

Query: “best laptop for programming”

Your system returns (with graded relevance 0-3):

RankDocumentRelevanceDiscount (1/log2(rank+1))Contribution
1”Top Programming Laptops 2024”31/log2(2) = 1.0003.000
2”Developer Laptop Guide”21/log2(3) = 0.6311.262
3”Gaming Laptop Review”01/log2(4) = 0.5000.000
4”Budget Coding Laptops”11/log2(5) = 0.4310.431
5”MacBook Pro for Developers”21/log2(6) = 0.3870.774

DCG@5 = 3.000 + 1.262 + 0.000 + 0.431 + 0.774 = 5.467

Ideal ranking (sorted by relevance): [3, 2, 2, 1, 0]

RankRelevanceDiscountContribution
131.0003.000
220.6311.262
320.5001.000
410.4310.431
500.3870.000

IDCG@5 = 3.000 + 1.262 + 1.000 + 0.431 + 0.000 = 5.693

NDCG@5 = 5.467 / 5.693 = 0.960 (96%)

This is a very good ranking! The system is close to optimal.

Key Insights

Final Thoughts

In the world of search engines, delivering the right result at the right time is everything. Evaluation metrics provide the tools to measure and improve that ability. By tracking these metrics, developers can fine-tune their systems to make sure users always get the best possible results.

Related Blogs

Catch all the latest releases and updates from ZeroEntropy.

ZeroEntropy
The best AI teams retrieve with ZeroEntropy
Follow us on
GitHubTwitterSlackLinkedInDiscord