Recall and Precision at k for Recommender Systems
Detailed Explanation with examples
Precision and recall are classical evaluation metrics in binary classification algorithms and for document retrieval tasks. These metrics have been “Translated” to help us evaluate recommendation systems.
To understand how these metrics work, we need to first understand the workflow of recommendation systems and then how to evaluate them.
Recommendation system workflow
The diagram below explains a workflow of recommendation systems. First a training set is fed to a recommendation algorithm which produces a recommendation model that can be used to generate new predictions. To evaluate the model a held out test set is fed to the learned model where predictions are generated for each user-item pair. Predictions with known labels (true value) are then used as an input to the evaluation algorithm to produce evaluation results.
Many evaluation metrics are available for recommendation systems and each has its own pros and cons. For example RMSE can be computed by comparing the predicted rating to the true rating for each user-item pair with a known label.
In our case however we are only interested in calculating precision and recall at k. Precision and recall are binary metrics used to evaluate models with binary output. Thus we need a way to translate our numerical problem (ratings usually from 1 to 5) into a binary problem (relevant and not relevant items)
Translating to binary
To do the translation we will assume that any true rating above 3.5 corresponds to a relevant item and any true rating below 3.5 is irrelevant. A relevant item for a specific user-item pair means that this item is a good recommendation for the user in question.
3.5 is just a threshold value I chose. There are multiple ways to set this threshold value such as taking into consideration the history of ratings given by the user. for the sake of simplicity, we will stick to the 3.5 threshold.
In the context of recommendation systems we are most likely interested in recommending top-N items to the user. So it makes more sense to compute precision and recall metrics in the first N items instead of all the items. Thus the notion of precision and recall at k where
k is a user definable integer that is set by the user to match the top-N recommendations objective.
Relevant vs. Recommended
We have already seen the definition of a relevant items. In the rest of the article we will user relevant and recommended items frequently. Here is a good point to pause and grasp their exact definition.
# Relevant items are already known in the data set
Relevant item: Has a True/Actual rating >= 3.5
Irrelevant item: Has a True/Actual rating < 3.5# Recommended items are generated by recommendation algorithm
Recommended item: has a predicted rating >= 3.5
Not recommended item: Has a predicted rating < 3.5
Precision and recall at k: Definition
Precision at k is the proportion of recommended items in the top-k set that are relevant
Its interpretation is as follows. Suppose that my precision at 10 in a top-10 recommendation problem is 80%. This means that 80% of the recommendation I make are relevant to the user.
Mathematically precision@k is defined as follows:
Precision@k = (# of recommended items @k that are relevant) / (# of recommended items @k)
Recall at k is the proportion of relevant items found in the top-k recommendations
Suppose that we computed recall at 10 and found it is 40% in our top-10 recommendation system. This means that 40% of the total number of the relevant items appear in the top-k results.
Mathematically recall@k is defined as follows:
Recall@k = (# of recommended items @k that are relevant) / (total # of relevant items)
An Illustrative example
In this example we will illustrate the method to calculate
As a start we will ignore all the ratings where the actual value is not known. Values with no known true rating cannot be used.
We will sort the rest of the items by descending prediction rating. The results will be as follows:
will be as follows:
The number of relevant items are the items with actual rating greater or equal to 3.5.
Relevant items: item5, item10 and item1
total # of relevant items = 3
Recommended items @ 3:
The recommended items at 3 are
item7, item5 and item10
Recommended items @ 3: item7, item5 and item10
# of recommended items at 3 = 3
Recommended and Relevant items @ 3
It is the intersection between Recommended@3 and Relevant@3 which are
Recommended@3 INTERSECTION Relevant: item5 and item10
# of recommended items that are relevant @3= 2
Precision @ 3:
We can compute the precision which is 66.67%. Here we can interpret that only 66.67% of my recommendations are actually relevant to the user.
=(# of recommended items that are relevant @3)/(# of recommended items at 3)
Recall @ 3:
Here we can interpret that 66.67% percent of the relevant items were recommended in the top-k items
= (# of recommended items that are relevant @3)/(total # of relevant items)
In the computation of precision@k, we are dividing by the number of items recommended in the top-k recommendation. If there are no items recommended. i.e. number of recommended items at k is zero, we cannot compute precision at k since we cannot divide by zero. In that case we set precision at k to 1. This makes sense because in that case we do not have any recommended item that is not relevant.
Similarly, when computing recall@k we might face a similar situation when the total number of relevant items is zero. In that case we set recall at k to be 1. This also makes sense because we do not have any relevant item that is not identified in our top-k results.
Finally I would like to add that what I explained above is just one way to compute precision and recall at k. Other variation of these evaluation metrics are available in literature and can be more complex.
A python implementation of the metrics explained above can be found in the FAQ section of the Surprise Library. Here is a direct link. To get you started with recommender systems and Surprise you can check this article here