Mean Average Precision averages precision at each rank where a relevant document appears, then averages across queries. The older sibling of NDCG — comparable for binary relevance, weaker for graded relevance.
Mean Average Precision is a classical IR metric that averages precision across the ranks where relevant documents appear, then averages across queries.
For a single query, average precision is:
Where is the number of relevant documents, is precision at rank , and is 1 if the doc at rank is relevant and 0 otherwise. In practice: walk down the result list; every time you hit a relevant doc, record the precision at that rank; average those numbers.
Mean Average Precision is the mean of AP across all queries.
What MAP rewards
MAP rewards ranking relevant documents high — precision at low ranks counts more because recall hasn’t yet diluted it. A relevant doc at position 1 contributes ; the same doc at position 10 contributes at most . So MAP, like NDCG, penalizes burying relevant documents.
It also rewards finding all relevant documents — a query with 5 relevant docs only achieves AP = 1.0 if all 5 appear at positions 1-5.
Walk through the result list and plot at every rank. Recall ticks up by each time you hit a relevant doc; precision is recomputed at that rank. AP is the sum of over relevant ranks, which is exactly the right Riemann-style integral of the PR curve where each step has width . So AP measures how high precision stays as recall grows. A model that has high precision at low recall but collapses near full recall scores worse than one that holds precision steady — even if both find the same docs eventually.
MAP vs NDCG
The two metrics are closely related and often correlate above 0.9 across systems on the same benchmark. The substantive differences:
MAP
NDCG
Binary labels
Native
Native
Graded labels
Drops grade info
Uses grade as gain
Discount
Implicit (1/rank)
Explicit (log2)
Top-K cutoff
MAP@K (less common)
NDCG@K (standard)
Binary labels — the two metrics behave similarly. Graded labels — NDCG uses them; MAP can’t. For modern retrieval evaluation, where graded relevance (often via LLM-as-judge ) is the gold standard, NDCG is strictly more informative. That’s the main reason NDCG dominates current benchmarks while MAP appears more often in older IR papers.
When to use MAP
Binary-relevance benchmarks where MAP is the established convention (TREC tasks, BEIR’s binary subset, classical IR papers). Reporting MAP keeps you comparable to prior work.
As a sanity check alongside NDCG. If MAP and NDCG strongly disagree on your system, something is off — usually graded labels are noisy or there are queries with no relevant docs distorting the average.
When you want a single-number summary of the PR curve. MAP integrates precision over recall; it’s the area-under-PR-curve in expectation.
What to watch out for
Queries with no relevant documents are typically dropped from MAP (otherwise AP is undefined). Be explicit about how you handle them.
MAP@K vs MAP — full MAP considers all ranks; MAP@K truncates. If your downstream consumer only sees top-10, MAP@10 is more relevant than MAP over the full list.
Single-relevant-doc queries — AP collapses to MRR (the reciprocal of the rank of the first relevant doc), so MAP and MRR are equivalent on those.
Where MAP fits in a modern stack
In a graded-relevance world, NDCG@K is the primary metric, Recall@K covers first-pass quality, and MRR covers top-1 consumers. MAP is still cited and still useful for comparing against legacy systems and binary benchmarks, but rarely the headline number for new evaluations.
Go further
MAP vs NDCG — which one when?
If your relevance labels are binary, MAP and NDCG agree closely and MAP is fine. If labels are graded (highly relevant > marginally relevant > irrelevant), NDCG uses the grades and MAP throws them away. Modern retrieval evaluation has moved to graded relevance, which is why NDCG dominates current papers.
It's the single-number summary of the precision-recall curve. Every time recall ticks up (because you hit a relevant doc), measure precision at that rank, then average. So MAP reflects both how many relevant docs you found and how high they were ranked — without explicit logarithmic discount.
Same formula but only counting relevant docs that appear in the top-K. Useful when only the top of the list matters and you want a precision-style metric in that range. NDCG@K is usually a closer fit for the same intent, but MAP@K is still standard in some IR benchmarks.