Popular evaluation metrics in recommender systems explained

Explaining recall@k, precision@k and average precision@k with examples from TensorFlow.

In this post, we will be discussing evaluation metrics of relevance in recommender systems and how we can use in-house methods from TensorFlow to assess the quality of our models. But let’s start with what a recommender system is and why it is useful.

Motivation for Recommender Systems

A recommender system is an algorithm trained on user tastes and tasked to recommend items of interest based on a user’s profile. If that’s vague enough, you’d be surprised to see that recommendation systems are used abundantly in our everyday interactions with apps and sites. For example, Amazon is using them to recommend products, Spotify to recommend similar music based on our listening history, YouTube to recommend related videos, Google to serve personalized advertisements and search results, Netflix to recommend movies and TasteDive to recommend a wide variety of different categories (music, movies, shows, books, games, podcasts, etc). The value of recommender systems is massive and often drives a critical amount of revenue or traffic in businesses. For example, 35% of user purchases in Amazon and 75% of show/movie selections in Netflix are driven from the output of recommendation algorithms [source: McKinsey & Company].

Usually, we assess the quality of our recommendations based on how relevant they are to our preferences and how non-intuitive yet interesting they can be. For the former, we judge relevance with metrics such as recall and precision (that we will expand on later) and for the latter with metrics like diversity, coverage, serendipity, and novelty. We will refer to the latter metrics collectively as metrics of serendipity. In this post, we will only be focusing on metrics of relevance, but if you’d like to read an excellent summary on metrics of serepindity, please have a look at this post: Recommender Systems — It’s Not All About the Accuracy. Before we expand further on measures of relevance, let’s justify briefly why both types of metrics are equally critical. When recommendations are relevant, or in other words, when they are based on our historical preferences, it is more likely to keep us happy and engaged on the services we use. For example, when someone is a big fan of RnB music, she is more likely to enjoy recommendations of similar RnB songs or artists rather than songs related to Heavy Metal music. On the other hand, if someone is recommended things that are too obvious without bringing anything new to the table, this bears the risk of creating boring, non-actionable recommendations. Imagine you bought basketball shoes from Amazon and the next thing you are recommended are similar basketball shoes. Even though this recommendation is relevant it will not lead to any action. On the other hand, if you are recommended basketball socks, or an actual basket ball, this is more likely to lead to a new action. Enough with the motivational jargon, let’s move on to what this post is really about…


What are recall@k and precision@k?

Precision and recall are evaluation metrics that are commonly used in classification settings. In the context of recommender systems, we use metrics like recall@k and precision@k, since it is more common to provide k recommendations per example. An example can be a user query in our case. I.e., if a user likes a particular set of movies (the query), the output will be a list of k movies (of potential interest to her). Sticking to our movie paradigm, a user likes 70 out of 200K total movies (which translates to this example having 70 labels), we can represent this with 1x200,000 vector where all elements would be 0 except for 70 being 1 (in the positions corresponding to the movies the user has expressed interest in). This is called multi-hot encoding.
In that respect, we can treat this problem as a multi-label classification problem, where an example is labeled as multiple different things. A similar and more familiar concept borrowed from Computer Vision is when we have an image and we try to label it with the different animals present in the image. On the other hand, multi-class classification is when we have multiple classes but only one label per example (e.g., image is classified only as cat, only as dog, as horse etc). Let’s proceed to define recall@k and precision@k.

Recall@k is defined as the percentage of true labels captured when we recommend k labels per example. Its exact definition is:

On the other hand, precision@k is defined as the percentage of predictions we get right. Its exact definition is:

Let’s borrow an example from medicine (hopefully not too graphic) to explain the different goals of recall and precision. Assume a doctor is to remove a patient’s metal, non-life threatening implants that are used to speed up the recovery of a broken leg. Recall is concerned with removing as many of these implants as possible. Precision is concerned with removing the implants with as few attempts as possible. If the doctor is sloppy, a patient could have all of her implants removed in four consecutive operations which would result in very high recall, very low precision. On the other hand, a minimally intrusive operation that lasts only half an hour but only removes one of the implants would result in very low recall, very high precision. Both of these situations are unacceptable. The best of both worlds would be to optimize for one metric, keeping the other metric above a minimum acceptable threshold. We will talk more about this later on.

Going back to the definitions of recall@k and precision@k, assume an example has three true labels and the algorithm gives 5 recommendations. If two of these recommendations correspond to true labels, then # of true labels captured = 2, # of true labels = 3 and # of predictions made = 5. Therefore, recall@k = 2/3 = 0.67, precision@k = 2/5 = 0.4. For simplicity, we will denote the # of true labels captured as TP@k, which is known as true positives@k. In the general case of batch_size examples and k recommendations per example, # predictions made = batch_size * k. Also, # of true labels will refer to the total number of labels in the batch. Both recall and precision take values between 0 and 1.

There is usually an inverse relationship between recall and precision. In addition, there’s a monotonic relationship between recall@k and k (the number of recommendations per example), while precision@k tends to lower with k. The monotonicity between recall@k and k is very easy to show. The denominator in recall@k is # of true labels which remains constant. The more recommendations we add per example, the more likely it is one of them to be a true label, which would increase the numerator TP@k. Now, when it comes to precision@k both the numerator (TP@k) and denominator (# predictions made) are affected with the addition of new recommendations. Let’s assume we increase the resulting output to one more recommendation per example. That is, we go from k to k+1. What condition should hold so that precision@(k+1) ≥ precision@k?

When do we have precision@(k+1) ≥ precision@k?

TP@(k+1)-TP@k = 1, if the k+1ᵗʰ recommendation is a true label and 0, otherwise. Since, 0 ≤ precision@k ≤ 1, precision@(k+1) ≥ precision@k if and only if the k+1ᵗʰ recommendation is a true label. In general, if we add m new recommendations per example, we can show (we will avoid the proof here but it follows the same logic to the proof above) that precision@(k+m) ≥ precision@k if any only if

which basically translates to requiring the precision of the additional m recommendations being larger than the current precision. The above equation shows also why precision@k is more likely to decrease with k. If an example has L true labels and we already predicted some of them, it is more and more difficult to uncover the remaining labels for every recommendation we add. Lastly, let’s give an example with TensorFlow to make this a bit more clear.

Here, we have just one example in the batch (we can easily generalize to batch_size examples) that has two true labels: 0 and 3. We assume we can have up to 6 labels per example, which is the number of columns in predictions. We will try different values of k (number of recommendations per example) and check the effect they are having in recall@k and precision@k.
The output of the above snippet is:

recall@1 = 0.00, precision@1 = 0.00
recall@2 = 0.00, precision@2 = 0.00
recall@3 = 0.50, precision@3 = 0.33
recall@4 = 0.50, precision@4 = 0.25
recall@5 = 1.00, precision@5 = 0.40

The predicted labels are ranked according to the score defined in the variable predictions. Label 1 comes first with the highest score (.50), label 2 second (.30), label 0 third (.10) and so on. The predicted labels in terms of descending score are: 1, 2, 0, 4, 3, 5. When k=1, recall@1 and precision@1 are 0, because the top predicted label (1) is not one of true labels (0,3). Same when k=2. When k=3, the third top predicted label is 0, which happens to be one of true labels. In this case, recall@3=1/2=0.5, while precision@3=1/3=0.33. Note that as k increases, recall@k goes up monotonically (or to be more correct, it cannot go down)! While precision@k can go up or down depending on the relevance of the added recommendations. For example, when we go from k=3 to k=4, precision@k decreases, but when we go from k=4 to k=5, it increases since the fifth top predicted label, which is label 3, is one of true labels. As a rule of thumb, recall@k goes up and precision@k goes down with k.


Striking a balance between recall@k and precision@k

A natural question that arises is how to compare different models or determine the best k based on precision and recall. As Andrew Ng puts it in his draft book Machine Learning Yearning, the more appropriate approach when considering multiple evaluation criteria is to have only one as an “optimizing”, while the remaining as “satisficing” criteria. This essentially means to pick the model with the highest optimizing metric as long as all satisficing metrics exceed a minimum or do not surpass a maximum threshold. The metric to choose as the optimizing one is totally task-dependent. For example, I would imagine that in Amazon (don’t take my word for it!) they would be more focused on higher precision than recall. In other words, they might be more interested in providing a short list of highly relevant recommendations to a user that capture well some of her needs than a longer, wider (and maybe less targeted) list. Contrast that to an online learning platform like Coursera: if there was an option for the user to be recommended courses to build a skillset (so that she can do well on interviews), recall might be more well suited as an optimizing metric. For example, if Deep Learning, Python and TensorFlow were required skills in data science job postings and I wish to build my knowledge around these topics, it wouldn’t hurt to be recommended a wider set of classes that would more holistically cover the various concepts of these topics.

As a side note, metrics need not have a purely mathematical notion. A metric could be the cost per recommendation. In the Coursera example, one metric might be the monthly membership cost. If they were to recommend classes that increase the likelihood of a user extending her monthly membership while not paying much attention to relevance, this could increase customer dissatisfaction. In case of wearable devices or smart phones, a metric can be the size of the model, or in time sensitive settings like high-frequency trading the algorithm’s response time and so on. For a more detailed explanation of satisficing and optimizing metrics, please take a look at the video below.

Satisficing and optimizing metrics

Last but not least, you can try to combine recall and precision in a single number evaluation metric. This would make the comparison across different models much more straightforward. For example, we can easily conclude that a model with a higher precision and recall is superior to one with lower recall and precision, but we cannot be sure a model is better if it does better in one metric but worse on the other. Even though you can come up with your own combined metrics of recall and precision, it is very common to use the harmonic mean of those two metrics, also known as F1 score. Its formal definition is:

F1 score takes values between 0 and 1 just like recall and precision. It becomes 0 when one of precision or recall is 0 and 1 when both precision and recall are 1. You can see how F1 score changes with respect to precision and recall in the picture below.

F1 score with respect to precision and recall.

Note that when one of two metrics (precision, recall) performs bad, F1 score performs bad as well (direction southwest to northeast in the picture above). When both metrics receive higher values, F1 score receives a higher value as well. 
In fact, you can set more weight on one of recall or precision depending on which is more important for the task at hand. For this, you can use the generalized version of F score, called Fᵦ score. The formula for Fᵦ score is:

where β is non-negative number. When β=1, we are back to our familiar F1 score! Values of β much lower than 1, give more emphasis on precision, while values of β much higher than 1 give more emphasis on recall. See, for example, the image below. On the left subplot, we have set β to a very small value (β=0.1) that almost cancels out the effect of recall (contour color does not change along the vertical axis, which represents recall). On the other hand, the right subplot shows a case where β is very large (β=4), which essentially cancels out the effect of precision. In fact, when β=0, Fᵦ@k = precision@k, while when β=∞, Fᵦ@k = recall@k.

Fᵦ score with respect to precision and recall.

For a more thorough discussion on single evaluation metrics, you can have a look at the Andrew Ng’s video below.

Single Number Evaluation Metric

Average precision@k

The metrics of precision@k and recall@k are very good in identifying whether our recommendations are relevant, but not very good at telling us how early or late in the recommendation list they appeared. In other words, they don’t tell us anything about the rank of the recommendations. For example, if we provide two lists with 20 recommendations each, where in the first one the top two recommendations are relevant, while in the second list the last two recommendations are relevant both of these lists will result in exactly the same recall@k and precision@k! Obviously, the first list is much more useful and actionable than the latter.

Average precision@k takes into account the ranking of the recommendations as well. If we were to give a definition, average precision@k is the average of all precisions in the places where a recommendation becomes relevant (or in other words, where a recommendation is a true positive). First, we will focus to one example and then extend to a batch of examples so that we do not lose sight of the forest for the tree. If we denote the average precision@ with avgpre@k, its formula is:

precision@j is the precision (for this example), when we provide the top j recommendations. relevant@j is 1 when the jᵗʰ recommendation is relevant and 0, otherwise. This means, that relevant@j = 1, if the top jᵗʰ recommendation is one of the true labels (for this example), or in other words, a true positive. You can think of relevant as an 1 x k vector with 1s in the places where the recommendation is a true positive and 0, otherwise. avgpre@k takes values from 0 (if there are no relevant recommendations) to (precision@1 + … + precision@k)/k, if all recommendations are relevant.

Because we can have a case where an example has fewer labels (L) than the maximum number of recommendations provided (k), a variant of the formula that takes care of this is:

This way, we do not penalize our method for setting a value for k larger than L. (We know that when k > L, we cannot have more than L relevant recommendations and therefore it makes sense to stop our search to the top L recommendations and compute the average precision@L instead).

Because the math mumbo jumbo can be confusing, let’s give an example that illustrates how avgpre@k is computed. Assume we have L=3 true labels for an example, and we are interested in evaluating avgpre@6 (k=6). If in these six recommendations, the second and sixth are relevant (true positives), the avgpre@6 would be computed as:

If we are to expand the above definition to a batch of B examples, avgpre@k is defined as the average of the average precisions@k in each example (I know I know… too much ‘average’ going on here):

avgpre@k(i) is defined exactly as avgpre@k that we defined for a single example. The only difference is that L in this case represents the maximum number of labels per example in this batch.

If you were to keep just one takeaway from this definition, let it be that avgpre@k takes into account not just the relevance of a recommendation list, but the relative order of the recommendations as well. If you are interested in reading more about average precision@k, you can have a look at this article: Mean Average Precision (MAP) For Recommender Systems.

Here is a TensorFlow example.

The output of the above snippet is 0.4907. In the code above, variable labels represents the true labels for the batch of examples. Row 1 represents the labels for the first example (1, 0, 4), row 2 the labels for the second example (0) and row 3 the labels for the third example (0, 2). You will notice that label 0 is weirdly repeated three times in second row, while label 2 twice in third row. This is because TensorFlow can only support tensors where the dimensions are the same across examples. Newer version of TensorFlow (1.13) can support ragged tensors, where tensor slices may have different lengths.

Fortunately, this ‘artificial’ tiling does not pollute the results. If we were to compute recall@k, TensorFlow would still compute the # of true labels as 3 + 1 + 2 = 6. We further have that B = 3 (number of rows in variables labels and predictions). L=3 since the maximum number of labels per example in this batch is 3. We also want to compute avgpre@k when k = 4 and we can have up to 6 labels per example (this is the number of columns in the predictions variable). We can see in the following diagram how exactly TensorFlow comes up with number 0.4907.

Computation diagram for evaluating avgpre@k.

In the above diagram, we depict different labels with different colors for disambiguation. The predicted labels represent the indices of the predictions variable in descending order of its values. Let’s take example 1 (row 1 of predictions): label 1 is the top recommendation (since it has the highest predictions score, 0.5), label 2 is the second top recommendation with a score of 0.3, label 0 the third top recommendation with a score of 0.1 and so on. Now, the relevant vector would be set to 1 in a position where the predicted label is a true positive and 0, otherwise. Take for instance the relevant vector for example 1. It is [1 0 1 1 0 0]. The first element is 1, because the predicted label in the first position (1) is one of the true labels. The second element is 0, because the predicted label, which is 2, is not a true label. The third element of the relevant vector is 1, because the predicted label in that position (0) is one of the true labels. (As a reminder, the true labels for example 1 are 0, 1, 4). precision@k is defined as the percentage of predictions we get right (when we make k recommendations). If you follow this logic in all of the examples, you will arrive to the number listed in TensorFlow’s output.


Conclusion

Recommender systems play a big part in our everyday lives in ways we do not even realize. Good recommendations drive our decision making, lead us to actions and even help us shape opinions. A good recommendation system needs to be both relevant and novel — personalized enough, yet insightful (to give us this ‘aha’ moment)! For this, we need to use metrics that offer an objective assessment of the recommendations’ output. In this post, we expanded on metrics that examine the relevance of recommendations. We talked about recall@k, precision@k and the need for use of a single objective metric to compare among different models. Lastly, we explained the concept of average precision@k that takes into account not just the relevance but also the relative order of a relevant recommendation and we provided TensorFlow code for all the discussed metrics.