Evaluating Recommender Systems

Metrics for evaluating Recommender Systems

Published in

Nerd For Tech

5 min readJan 27, 2021

So welcome back to my new blog on Evaluating Recommendation Systems. Today we are gonna dive into various metrics that help to evaluate the modern recommender systems. From quite some time, I wanted to move aside from NLP and Text processing and wanted to explore other fields related to Machine Learning. Recommendation System was on the top of my learning list, so went on to buy a Udemy course named Building-recommender-systems-with-machine-learning-and-ai by Frank Kane. I have completed about a third of the course and it seems a really good start to dive into recommender systems. This blog is mostly inspired by my learning from that course. Let’s begin 😇.

We will start with the very basic metrics which measure the accuracy of recommender systems.

Mean absolute error (MAE)

This is the most straightforward metric of evaluation known as Mean absolute error. The above is a fancy equation for evaluating it. It is literally the difference between what user might rate a movie to what our system predicts.

Root mean square error (RMSE)

This is another common and perhaps most popular metric of evaluation. One reason is that it penalises you way less when you're close to actual prediction and way more when far from actual prediction compared to MAE.

Actually, we really do not evaluate any modern recommender systems based only on accuracy. Recommender systems system could not care less how would a user have rated a certain movie 🧐. What matters for recommender systems is what it puts in front of users in a top recommender list and how those users react to those movies when they see them recommended.😮 ( The reason why accuracy become major evaluation benchmark is Netflix 1 million dollar prize which used accuracy to evaluate recommender systems. Even though they awarded the system was never adopted by Netflix)

So if the recommender system does not focus only on accuracy then what should they do?🤔

The primary task is Top-N recommendations which means recommender systems job is to produce a finite list of best things to present to a given person.

The following metrics are used for evaluating a recommendation system based on Top-N recommendations.

Hit rate

This is a simple metric. First, you generate a top-N recommendation for a user. If one of the recommendations in a user's top-end recommendations is something they actually rated, you consider that a hit. Since the system actually managed to show the user something that they found interesting enough to watch on their own already, so we’ll consider that a success.

So to calculate we add up all the hits in the top-N recommendations for every user and divide it by every user.

Average reciprocal hit rate (ARHR)

This is a variation of hit rate but it accounts for wherein Top-N list your hits appear. So we get more credit for recommending items in the top slot than in bottom slot. This metric more of user-focused. If the user has to scroll down to see a lower item in your Top-N list that it makes sense to penalise recommendation that appears too low in the list as the user has to work to find them.

There are several other different things matter for a recommendation system. Now let's check out those.

Coverage

In simple words, its the % of (user, item) pairs that can be predicted or percentage of possible recommendation that the recommender system can provide. For example, Think about the MovieLens data set of movie ratings. It contains ratings for several thousand movies, but there are plenty of movies in existence that it doesn’t have ratings for.

Therefore if we are using this data for recommending movies on IMDB which contains several millions of movies the coverage would be quite low.

It’s worth noting that coverage can be at odds with accuracy. If you enforce a higher quality threshold on the recommendations you make, then you might improve your accuracy at the expense of coverage.

Diversity

Think of this metric as to how broad a variety of items your recommender systems is showing to users.

Suppose if you watch James Bond movie. Low diversity would be a recommender system that just would recommend next parts of the James Bond series but doesn't recommend other movies which is not a part of James Bond series but still related to the same genre.

Very high diversity is also not always good. Completely random items have high diversity but those are not very good recommendations. You also need to check diversity alongside some other metric that measures the quality of recommendations as well.

Novelty

The novelty in case of recommender systems refers to how popular are the items that it is recommending. (i.e. mean popularity rank of recommended items)

And again, just recommending random stuff would yield very high novelty scores since the vast majority of items are not top sellers. Although novelty is measurable, what to do with it is in many ways subjective.

There’s a concept of user trust in a recommender system. People want to see at least a few familiar items in their recommendations.

If we only recommend things people have never heard of, they may conclude that your system doesn’t really know them, and they may engage less with your recommendations as a result. Also, popular items are usually popular for a reason. They’re enjoyable by a large segment of the population, so you would expect them to be good recommendations for a large segment of the population who hasn’t read or watched them yet.

We need to strike a balance between familiar, popular items and what we call the serendipitous discovery of new items the user has never heard of before. The familiar items establish trust with the user, and the new ones allow the user to discover entirely new things that they might love.

That’s all for this blog. Hope you like it. If you have any suggestions please comments.😁