IR: Graded Assessments are not for Mean-average-Precision (MAP)!

Setting: You have an information retrieval benchmark with a document collection and queries (aka topics). You want to evaluate the quality of the document ranking produced by your algorithm.

Your benchmark contains graded assessments. That means for every query/document pair, the assessment of how relevant that document is for the query is not just binary (relevant=1 vs. non-relevant=0) but distinguishes into sort-of-relevant (1) to relevant (3) to very super duper relevant (5) or alternatively spam (-2)

Quality metrics that are often used in this case are:

  • Precision@10 (with the top 10, how many documents are relevant?)
  • Recall@10 (how many of the relevant documents are in the top 10?)
  • R-Precision (knowing that R documents are relevant, how many of them are in the top R of the ranking?)
  • MAP which stands for Mean-Average Precision (for each rank K at which a relevant document is found, compute Precision@K, then take the average of those values.) — despite its name, MAP actually measures recall, not precision.

From my wording you should take, that none of these metrics make any use of the graded assessments. They only incorporate binary relevance, i.e., whether the document is relevant or not.

Period. IF YOU HAVE GRADED JUDGMENTS, DON’T USE THEM!

You can kind of bake your own, but looking at different relevance-levels, such as “everything assessed 2 or less is non-relevant, everything 3 and above is relevant” and take the average across those. But this is not a well-established metric.


The only established metric that was designed to include graded assessments is NDCG (which stands for normalized Discounted Cumulative Gain). However, this also discounts brownie points when relevant documents are on a lower rank with a logarithmic discount function. Also, be aware that there are multiple versions of NDCG floating around!