Evaluating Recommender Systems

Part 5 of Evaluation Metrics Series

Saumyadeepta Sen
The Owl
14 min readJul 21, 2020

--

In the previous posts in the Evaluation Metrics series, we have only discussed about the evaluation metrics used for Classication tasks. In this post we are going to discuss about the metrics used for evaluation of Recommender Systems.

Recommender systems can be evaluated through several metrics and offline experiments. Metrics for recommender systems can be grouped into several groups. Each group has a particular purpose, which we are going to discuss in this post.

Lets go through the most popular metrics for recommender systems. These metrics are used for different cases and one cannot be stated to be better than the others.

First, let us see the methodology of testing recommender systems offline. If we have done Machine Learning before we are familiar with the concept of train/test splits. Now recommender system is a Machine Learning system. We train it using user behaviour and then make prediction about what items an user may like. So on paper at least we can evaluate recommender system like any other Machine Learning systems. Here’s how we do it :-

We first split the data into training set and testing set. Usually the training set is much bigger than the testing set. Now we training our recommender system using the training set only. This is where it obtains the relationship between items and users. Once its trained we can ask it to make predictions about how a new user might rate some of the items they have never seen before.

If we really want to get fancy, its possible to improve train/test splits by a technique called K-fold Cross Validation. Its the same idea as train/test but instead of a single training set we create many randomly assigned training sets. Each individual training set or fold is used to train our recommender system independently and then we measure the accuracy of the resulting systems against our test set. So we end up with a score how accurately each fold ends up predicting user ratings and we can average them together. This obviously takes a lot more computing power to do but the advantage is we donot end up in overfitting to a single training set. Here’s how we do it :-

To reiterate, train/test and k-fold cross-validation are ways to measure the accuracy of our recommender system. That is, how accurately we can predict how users rated movies they have already seen and provided a rating for. By using train/test, all we can do is test our ability to predict how people rated movies they already saw. That’s not the point of a recommender system. We want to recommend new things to people that they haven’t seen, but find interesting. However, that’s fundamentally impossible to test offline. So researchers who can’t just test out new algorithms on real people, on Netflix or Amazon, or whatever, have to make do with approaches like this. We haven’t talked about how to actually come up with an accuracy metric when testing our recommender system. Now let’s see how we do this :-

Accuracy Metrics

The most straightforward metric is mean absolute error or MAE. Here is the fancy, mathematical equation for how to compute it.

Let’s say we have n ratings in our test set that we want to evaluate, for each rating we can call the rating or system predicts y, and the rating the user actually gave x. We just take the absolute value of the difference between the two, to measure the error for that rating prediction. It’s literally just the difference between the predicted rating and the actual rating. We sum those errors up across all n ratings in our test set, and divide by ‘n’ to get the average, or mean. So mean absolute error is exactly that, the mean or average absolute values of each error in rating predictions. We want the lowest MAE score, not the highest.

A slightly more fancy way to measure accuracy is root mean square error or RMSE. This is a more popular metric for a few reasons, but one is that it penalizes us more when our rating prediction is way off, and penalizes us less when we are reasonably close.

The difference is that instead of summing up the absolute values of each rating prediction error, we sum up the squares of the rating prediction errors instead. Taking the square ensures we end up with positive numbers, like absolute values do, and it also inflates the penalty for larger errors. When we’re done we take the square root to get back to a number that makes sense. Again we remember that high RMSEs are bad, we want low error scores, not high ones. Now like we said, accuracy isn’t really measuring what we want recommender systems to do, actual people couldn’t care less what our system thinks we should have rated some movie we already saw and rated. Rating predictions themselves are really a limited value, we don’t really care if the system thinks we will rate up three stars, we care mostly about what our system thinks about the best movies for us to go see are and that’s a very different problem.

Evaluating top-n Recommenders

Hit-Rate

We generate top end recommendations for all of the users in our test set. If one of the recommendations in a users’ top-end recommendations is something they actually rated, we consider that a hit. We actually managed to show the user something that they found interesting enough to watch on their own already, so we’ll consider that a success. We just add up all the hits in our top-end recommendations for every user in our test set and divide by the number of users, and that’s our hit rate.

Hit rate itself is easy to understand but measuring it is a little bit tricky. We can’t use the same train/test or cross validation approach which we used for accuracy because we’re not measuring the accuracy on individual ratings. We’re measuring the accuracy of top-end lists for individual users. Now we could do the obvious thing and not split things up at all and just measure hit rate directly on top-end recommendations created by a recommender system that was trained on all of the data we have. But technically that’s cheating. We generally don’t want to evaluate a system using data that it was trained with. We could just recommend the actual top 10 movies rated by each user using the training data and achieve a hit rate of 100%. So a clever way around this is called “Leave-one-out cross validation”. What we do is compute the top-end recommendations for each user in our training data and intentionally remove one of those items from that users training data. We then test our recommenders system’s ability to recommend that item that was left out in the top-end results it creates for that user in the testing phase. So we measure our ability to recommend an item in a top-end list for each user that was left our from the training data. That’s why it’s called “leave-one-out”. The trouble is it’s a lot harder to get one specific movie right while testing than to just get one of the recommendations. So “hit rate” with “leave-one-out” tends to be very small and difficult to measure unless you have a very large data set to work with. But it’s a much more user focused metric when we know our recommender system will be producing top-end lists in the real world, which most of them do.

Average Reciprocal Hit Rate(ARHR)

A variation on “hit rate” is “average reciprocal hit rate” or “ARHR” for short. This metric is just like “hit rate” but it accounts for where in the top-end list your hits appear. So we end up getting more credit for successfully recommending an item in the top slot than in the bottom slot. Again, this a more user focused metric since users tend to focus on the beginning of lists. The only difference is that instead of summing up the number of hits we sum up the reciprocal rank of each hit.

So if we successfully predict our recommendation in slot three, that only counts as 1/3. But a hit in slot one of our top-end recommendations receives the full weight of 1.0.

Cumulative Hit Rate(CHR)

Sounds fancy but all it means is that we throw away hits if our predicted rating is below some threshold. The idea is that we shouldn’t get credit for recommending items to a user that we think they won’t actually enjoy.

Rating Hit Rate(RHR)

It can be a good way to get an idea of the distribution of how good our algorithm thinks recommended movies are that actually get a hit. Ideally we want to recommend movies that they actually liked and breaking down the distribution gives us some sense of how well we’re doing in more detail. This is called “rating hit rate” or “RHR” for short.

Accuracy isn’t the only thing that matters with recommender systems. There are other things we can measure if they’re important to us. For example, coverage. That’s just the percentage of possible recommendations that our system is able to provide. Let’s think about a movies’ data set of movie ratings. It contains ratings for several thousand movies, but there are plenty of movies in existence that it doesn’t have ratings for. If we were using this data to generate recommendations on, say, IMDb, then the coverage of this recommender system would be low, because IMDb has millions of movies in its catalog, not thousands. It’s worth noting that coverage can be at odds with accuracy. If we enforce a higher quality threshhold on the recommendations we make, then we might improve your accuracy at the expense of coverage. Finding the balance of where exactly we’re better off recommending nothing at all can be delicate. Coverage can also be important to watch, because it gives us a sense of how quickly new items in our catalog will start to appear in recommendations.

Another metric is called diversity. We can think of this as a measure of how broad a variety of items our recommender system is putting in front of people. An example of low diversity would be a recommender system that just recommends the next books in a series that we’ve started reading, but doesn’t recommend books from different authors, or movies related to what we’ve read. This may seem like a subjective thing, but it is measurable. Many recommender systems start by computing some sort of similarity metric between items, so we can use these similarity scores to measure diversity. If we look at the similarity scores of every possible pair in a list of top-N recommendations, we can average them to get a measure of how similar the recommended items in the list are to each other. We can call that measure S.

Diversity is basically the opposite of average similarity, so we subtract it from 1 to get a number associated with diversity. It’s important to realize that diversity, at least in the context of recommender systems, isn’t always a good thing. We can achieve very high diversity by just recommending completely random items. But those aren’t good recommendations by any stretch of the imagination. Unusually high diversity scores mean that we just have bad recommendations more often than not. We always need to look at diversity alongside metrics that measure the quality of the recommendations as well.

Similarly, novelty sounds like a good thing, but often it isn’t. Novelty is a measure of how popular the items are that we’re recommending. And again, just recommending random stuff would yield very high novelty scores since the vast majority of items are not top sellers. Although novelty is measurable, what to do with it is in many ways subjective. There’s a concept of user trust in a recommender system. People want to see at least a few familiar items in their recommendations that make them say, “Yeah, that’s a good recommendation for me. This system seems good.” If we only recommend things people have never heard of, they may conclude that our system doesn’t really know them, and they may engage less with our recommendations as a result. Also, popular items are usually popular for a reason. They’re enjoyable by a large segment of the population, so we would expect them to be good recommendations for a large segment of the population who hasn’t read or watched them yet. If we’re not recommending some popular items, we should probably question whether our recommender system is really working as it should. This is an important point. We need to strike a balance between familiar, popular items and what we call serendipitous discovery of new items the user has never heard of before. The familiar items establish trust with the user, and the new ones allow the user to discover entirely new things that they might love. Novelty is important, though, because the whole point of recommender systems is to surface items in what we call “the long tail.”

Imagine this is a plot of the sales of items of every item in your catalog, sorted by sales. So the number of sales, or popularity, is on the Y axis, and all the products are along the X axis. We almost always see an exponential distribution like this. Most sales come from a very small number of items, but taken together, the “long tail” makes up a large amount of sales as well. Items in that long tail, the yellow part in the graph, are items that cater to people with unique interests. Recommender systems can help people discover those items in the long tail that are relevant to their own unique niche interests. If we can do that successfully, then the recommendations our system makes can help new authors get discovered, can help people explore their own passions, and make money for whoever we’re building the system for as well. Everybody wins. When done right, recommender systems with good novelty scores can actually make the world a better place. But again, we need to strike a balance between novelty and trust.

Another thing we can measure is churn. How often do the recommendations for a user change? In part, churn can measure how sensitive your recommender system is to new user behavior. If a user rates a new movie, does that substantially change their recommendations? If so, then our churn score will be high. Maybe just showing someone the same recommendations too many times is a bad idea in itself. If a user keeps seeing the same recommendation but doesn’t click on it, at some point should we just stop trying to recommend it and show the user something else instead? Sometimes a little bit of randomization in our top-N recommendations can keep them looking fresh, and expose our users to more items than they would have seen otherwise. But, just like diversity and novelty, high churn is not in itself a good thing. We could maximize your churn metric by just recommending items completely at random, and of course, those would not be good recommendations. All of these metrics need to be looked at together, and we need to understand the trade-offs between them.

One more metric is responsiveness; how quickly does new user behavior influence our recommendations? If we rate a new movie, does it affect our recommendations immediately or does it only affect our recommendations the next day after some nightly job runs? More responsiveness would always seem to be a good thing, but in the world of business we have to decide how responsive our recommender really needs to be, since recommender systems that have instantaneous responsiveness are complex, difficult to maintain, and expensive to build. We need to strike your own balance between responsiveness and simplicity.

Online A/B tests

Doing online A/B tests to tune our recommender system using your real customers, and measuring how they react to your recommendations.
We can put recommendations from different algorithms in front of different sets of users, and measure if they actually buy, watch, or otherwise indicate interest in the recommendations we’ve presented. By always testing changes to our recommender system using controlled, online experiments, we can see if they actually cause people to discover and purchase more new things than they would have otherwise. That’s ultimately what matters to our business, and it’s ultimately what matters to our users, too. None of the metrics we’ve discussed matter more than how real customers react to the recommendations we produce in the real world. We can have the most accurate rating predictions in the world, but if customers can’t find new items to buy or watch from our system, it will be worthless from a practical standpoint. If we test a new algorithm and it’s more complex than the one it replaced, then we should discard it if it does not result in a measurable improvement in users interacting with the recommendations we present. Online tests can help us to avoid introducing complexity that adds no value, and remember, complex systems are difficult to maintain. So remember, offline metrics such as accuracy, diversity, and novelty can all be indicators we can look at while developing recommender systems offline, but we should never declare victory until we’ve measured a real impact on real users from our work. Systems that look good in an offline setting often fail to have any impact in an online setting, that is, in the real world. User behavior is the ultimate test of our work. There is a real effect where often accuracy metrics tells us that an algorithm is great, only to have it do horribly in an online test.

Percieved Quality

Another thing we can do is just straight up ask your users if they think specific recommendations are good. Just like we can ask for explicit feedback

on items with ratings, we can ask users to rate our recommendations, too. This is called measuring the “perceived quality” of recommendations, and it seems like a good idea on paper, since, as we’ve learned, defining what makes a “good” recommendation is by no means clear. In practice though, it’s a tough thing to do. Users will probably be confused over whether we’re asking them to rate the item or rate the recommendation, so we won’t really know how to interpret this data. It also requires extra work from our customers with no clear payoff for them, so we’re unlikely to get enough ratings on our recommendations to be useful. It’s best to just stick with online A/B tests, and measure how our customers vote with their wallets on the quality of our recommendations.

Please refer to the below article to know more about Recommender Systems :

https://neptune.ai/blog/how-to-test-recommender-system

Hit the clap button if you like the story or think that it will help others!!

--

--