Wouldn’t it have been better if evaluating Recommendation Systems’ algorithms were as easy as evaluating other machine learning algorithms? If you have been into developing Recommendation Systems, you may agree with the bold statement that developing it is more art than science.

EVALUATION

Let’s dive deeper into the different ways we can evaluate our Recommendation System.

There are two different important sub-processes for evaluation —

  1. Evaluating Offline
  2. Online A/B Testing

Beware, big spoiler ahead!

NONE OF THE METRICS WE WILL DISCUSS MATTER MORE THAN HOW REAL CUSTOMERS REACT TO RECOMMENDATION YOU PRODUCE IN REAL WORLD!

Remember that offline metrics such as accuracy, diversity, and novelty (that we will discuss below) can all be indicators you can look at while developing Recommendation Systems offline but never declare victory until you measure real impact on real users. Sometimes they don’t work at all! User behavior is the ultimate test of our work.

There is a real challenge where often accuracy metrics tell your algorithm is great only to have it do HORRIBLE in online A/B Test. YouTube studies this and calls it “THE SURROGATE PROBLEM”

There is more art than science in the development of Recommendation = Systems

At the end of the day, online A/B tests are the only thing you can use for your Recommendation System.

Then why are we trying to Evaluate our recommender systems offline?

Because they are the ONLY indicators you can look at while developing your recommendation systems. Always preferring to go with online metrics to collect user behavior and scoring your system is expensive and time-consuming. Moreover, when continuous feedback are asked from users, they might become more hesitant to use our platform and not use it at all. Good accuracies in offline metrics followed by good online A/B scores are what you will be looking for.

We will be looking at the different ways we can evaluate our Recommendation System offline in the subsequent sections of this article. Online A/B tests will not be covered as their results are completely dependant on user behavior and their state of mind.

THE TWO DIFFERENT WAYS WE CAN TEST OFFLINE :

  1. Train-Test Split
  2. K-Fold Cross-Validation
Source: HT2014 Tutorial Evaluating Recommender Systems — Ensuring Replicability of Evaluation

Accuracies in the above methods depend on historical data and try to predict what actual users have already seen. If the data collected is too old, however high the accuracies maybe, they won’t mean anything as your interests a year back will not be as same as your interests a year from now!

Offline metrics are not measuring what we want and it is not recommended completely rely on offline metrics.

DIFFERENT OFFLINE METRICS

The different offline metrics and other measures that define our Recommendation System are mentioned below. Don’t let the terms scare you; We will study each of them separately an even develop respective python code.

  1. Mean Absolute Error (MAE)
  2. Root Mean Square Error (RMSE)
  3. Hit Rate (HR)
  4. Average Reciprocal Hit Rate (ARHR)
  5. Cumulative Hit Rate (cHR)
  6. Rating Hit Rate (rHR)
  7. Coverage
  8. Diversity
  9. Novelty
  10. Churn
  11. Responsiveness

Let us study them one after another…

1. MEAN ABSOLUTE ERROR (MAE)

It is the difference between the actual value(rating) and the predicted value. As you might have heard a lot about this metric, I won’t be covering it. All you have to know is lower the MAE value is, better will be our model.

Source: Data quest

2. ROOT MEAN SQUARED ERROR (RMSE)

RMSE is similar to MAE but the only difference is that the absolute value of the residual(see above image) is squared and the square root of the whole term is taken for comparison.

The advantage of using RMSE over MAE is that it penalizes the term more when the error is high. (Note that RMSE is always greater than MAE)

Source: Includehelp.com

The above two metrics are well known in the field of data science and machine learning which leaves me with nothing to talk about them. But one thing to note is that they are not complete withing themselves in case of Recommendation Systems i.e RMSE value of 0.8766 for an algorithm doesn’t mean anything until there is another algorithm with another RMSE value with which we can compare our current RMSE value.

To be honest, MSE or RMSE doesn’t matter in the real world. What matters the most is which movies you post in front of a user in top N recommendations and how users react to it.

In 2006, Netflix offered 1M dollars to its users in a competition based on RMSE score in-order to improve its recommendation system. It would have been better if Hit Rate was used instead of RMSE scores.

3. HIT RATE

HIT RATE = (HITS IN TEST) / (NUMBER OF USERS)

Hit Rate is a better alternative to MAE or RMSE. But it good to remember that however good the alternative may be, we are not exactly measuring what we want i.e we are predicting on historical data, not future data (which is possible only through online A/B tests)

To measure a Hit Rate, we first generate top N recommendations for all the users in our test data set. If generated top N recommendations contain something that users rated — 1 hit!

Greater the Hit Rate better will be our Recommendation System

Note: While computing Hit Rate, priority is given more to the top N list, not to the user.

Hit Rate is easy to understand but measuring it is tricky. Best way to measure it is using Leave One Out Cross-Validation.

LEAVE ONE OUT CROSS VALIDATION: We compute the top N recommendation list for each user in training data and intentionally remove one of those items form user’s training data. We then test our Recommender System’s ability to recommend “that” intentionally removed an item in our testing phase.

CHALLENGES:

  • It is hard to get one specific item correct than getting one of N recommendations. Especially when there are a huge number of items to deal with
  • Hit Rate in Leave One Out Cross Validation is VERY SMALL and difficult yo measure unless we have a VERY LARGE DATASET

4. AVERAGE RECIPROCAL HIT RANK (ARHR)

It is a variation of Hit Rate

  • The difference is that we sum up reciprocal of the rank of each hit.
  • It accounts for “where” in the top N lists our hits appear.
  • We get greater credit for successfully recommending item in the top slot than in bottom slot.
  • As it takes rankings into account, higher is better

Note: It is more user-focused(In contrast to Hit Rate) metric since users tend to focus more on the beginning of the list.

5. CUMULATIVE HIT RATE

  • In this variation, we throw away hits of our predicted ratings if they are below some threshold.
  • The main idea is that we should not get credit for recommending items to a user that they won’t enjoy.
  • Confined to ratings above a certain threshold. Higher is better.

6. RATING HIT RATE

Another way to look at Hit Rate — Break it down by predicted rating score.

Note: Small improvements in MSE leads to large improvements in Hit Rate. It means that MSE matters as well. But it turns out that we can build a Recommendation System with high Hit Rate but poor RMSE. So, they aren’t always related.

Other than accuracies, other factors determine the predictive power of our Recommender Systems. Some of them are covered below.

DIMENSIONS AND METRICS FOR EVALUATION (https://www.slideshare.net/xlvector/recommender-system-introduction-12551956)

7. COVERAGE

It is the percentage of possible recommendations (user-item pair) that system can predict

<user, item> pair

  • Coverage can be at odds with accuracy.

If you enforce a higher quality threshold on recommendations you make, then you might improve your accuracy at the expense of coverage. Find the balance.

  • It is a measure of how quickly new items will start to appear in our recommendation list. (similar to churn and responsiveness yet different)

For example, new books can’t enter a recommendation list until someone buys it and patterns are generated.

  • Measures the ability of the system to recommend long-tail items

8. DIVERSITY

It is a measure of how broad or variety of items our Recommender System is putting in front of people

Diversity = 1 — Similarity

  • Diversity and Similarity between recommendation pairs are opposite to each other

One thing to take care is that we can achieve high diversity by recommending completely random things. Hence it is not always a good thing… Unusually high diversity leads to bad recommendations.

9. NOVELTY

Sounds like a good thing, but it isn’t.

It takes into consideration “How popular are the items you are recommending”. Popular items are popular for a reason and hence are good for the recommendation. But there is a challenge to it — User Trust.

People need to see things that they are familiar with to believe those good recommendations are made by the Recommendation System. Otherwise, he may think that the recommendation system is bad and don’t engage with it or worse, may make fun of it in social media.

The Long Tail: The x-axis in the below graph corresponds to products and y-axis corresponds to the popularity of the product. The shaded region is called “The long tail”. Leftmost products in the items are more popular compared to the rightmost items.

( https://medium.com/daphni-chronicles)

Recommendation systems with good Novelty scores can make the world a better place. But it is more important strike balance between Novelty and Trust. That is why it is a bit of Art.

10. CHURN

How often do recommendations change?

If a user rates a new movie does it substantially change their recommendations? If yes, your churn score is high.

Showing the same recommendations all the time is a BAD IDEA. But just like Diversity and Novelty, high churn score is NOT a good thing. You can recommend randomly and still get high churn score. Hence, not good.

Note: Metrics must be looked at together and we must understand the trade-offs between them

11. RESPONSIVENESS

It measures “How quickly does new user behavior influence recommendations”

It may is similar to Churn score. But the key difference is Responsiveness is measured in time(How quickly changes are made) and Churn score in “How often” i.e number of times the changes are made in a given time interval.

High Responsiveness is a good thing but in a world of business, you must decide how responsive your Recommendation System must be.

Note: Instantaneous Responsiveness framework is complex, difficult to maintain and expensive to build. Strike balance responsiveness and simplicity.

WHAT TO FOCUS ON — WHICH METRIC?

Given that we have covered various metrics and dimensions for evaluating our Recommendation System; You might be thinking which metric is the best?! Well, it “depends”. There are many factors that we have to consider before giving priority to one metric over another. Metrics must be looked at together and we must understand the trade-offs between them. We must also focus on the requirements and main objective for building Recommendation System.

And most importantly, we must never forget the big spoiler —

NONE OF THE METRICS THAT WE DISCUSSED MATTER MORE THAN HOW REAL CUSTOMERS REACT TO RECOMMENDATION YOU PRODUCE IN REAL WORLD!

There is more art than science in building a Recommendation System and remember the “SURROGATE PROBLEM”. Finally, user behavior is the ultimate test of our work. Hence, always perform A/B online tests.

Python code for implementation of each of the above metrics can be found in Kaggle kernel (here)

--

--