Precision, Recall, and F1 Score in IR

Mohammad Derakhshan
2 min readMar 27, 2022

--

https://www.coe.int/documents/26934252/29839655/Evaluation.jpg/00f48243-54fc-3313-f1a2-d3972e272c56

Already we have seen how to implement the naive version of the Information Retrieval system. Also, we saw TFIDF improves the system's performance. In this article, we are about to evaluate our approach. The purpose of the evaluation is to assess two aspects. First, is the result close to the concept of the query? Second, does the system provide all related documents? We need to have a list of queries and respective documents that the system should return. We call this list Ground Truth (or gold standard).

For evaluating the result, we need to define some concepts. The first one is Precision. Precision is defined as below:

So, if we consider the below query and individual result, the Precision for the system is 4/7 = 0.57.

Picture from slides of IR course, Alfio Ferrara, University of Milan

The maximum value of Precision is one. But still, something is missing. We can't only trust precision. Imagine the IR returns only one document, and it is relevant. But actually, we have ten more records in the ground truth. In this scenario, Precision is one while the system didn't perform very well.

Therefore we need another measurement to assess the completeness of retrieved documents. We call it Recall. The Recall is defined as below:

For the above example, the Recall value is 4/6 = 0.66.

The system works well when both of these criteria are high. A specific combination of these two is called F1_score. The higher F1, the better system performance.

We can summarize these concepts in term of FP, FN, TP, TN as follow:

This article is inspired by the topics taught by Professor Alfio Ferrara at the University of Milan.

--

--

Mohammad Derakhshan

Hi! I'm Mohammad. A master's student at the University of Milan. I am an android expert who loves NLP! You can search for me on LinkedIn to make a connection!