TPS Report for Recommender Systems? Yeah, That Would be Great.

Standard Performance Metrics in Recommender Systems

Published in

Gab41

9 min readMar 13, 2016

Here at Lab 41, we launched a project called Hermes to look into recommender systems as a mean to answer the question, “How can we help analysts to connect the dots better across a number of different sources and data types?” My colleagues have spent a great deal of effort sharing their work in this space already, introducing recommender systems in general, going over what datasets we have used to apply recommender system algorithms, and discussing non-standard performance metrics we can use to compare these recommender system algorithms. This blog post will focus on standard performance metrics that we use to compare the different recommender system algorithms.

Think of a recommender system as fulfilling an information retrieval task — retrieving items in an ordered list for a particular user. The standard practice of quantifying how well the system retrieves this information in the scientific world is via accuracy and precision. Accuracy is determined by how close you are to the correct result while precision is how consistently you receive the same result.

We will discuss in more detail exactly what accuracy and precision mean in the world of recommender systems. For now, understand that we will use accuracy and precision as well as other performance metrics to quantify the prediction error of a recommender system. We will also look into Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) to determine how well a recommender system performs when predicting a non-binary rating — for example, predicting what rating a user will give a movie on a scale from 1 to 5. We will also delve into metrics that address the performance of a recommender system that works only with binary ratings, such as differentiating a movie that a user is likely to watch from one which a user will fall asleep in. In addition, we will define recall and F-score for supplementary insight into recommendations on binary values. And what better way to learn about these performance metrics than to get your hands dirty implementing and using them in Spark! At the edge of your seat already? Well I recommend that you read on!

Predictive Performance Metrics

There is a plethora of recommendation algorithms that you can apply to a dataset. We view recommendations as a supervised learning problem where the task is to predict users’ ratings of items, interactions between users and items, or even links between users and items. A model trained on a labeled set of user ratings or interactions can be used to make predictions (i.e. recommendations) on a separate set of inputs similar to the training set.

With as many algorithms to choose from as there are mobile dating apps, how can we be sure that we apply the right performance metrics for comparing algorithms? After all, with recommenders as with online dating, being choosy makes a difference!

The answer is…it depends on the dataset in question. If the algorithm makes predictions of a non-binary value, like what rating a user will give on a movie on a scale of 1 to 5, we can use RMSE or MAE. RMSE and MAE calculate the distance between the predicted value given by the algorithm and the actual value given by the user. Since accuracy is determined by how close the prediction is to the correct result, the lower the error is, the closer the prediction is to the actual value. In other words, with RMSE or MAE, we can determine how well a user’s ratings can be reproduced by the recommender system algorithm.

Root Mean Squared Error and Mean Absolute Error

Root Mean Squared Error (RMSE) can be described as

where p_i is the predicted value, r_i is the actual value, and n is the number of items to be predicted.

Subtracting r_i from p_i allows us to quantify how far off the predicted value is from the actual value. By squaring the difference, (p_i - r_i)^2, we can keep the error positive, no matter if the predicted or the actual value is higher. Summing up all of the differences and dividing them by the number of items to be predicted gives us the average of the prediction errors. Since squaring the difference changes the scale, we need to bring the scale back down by putting a square root over the entire equation. RMSE, therefore, describes the average of how far off the predicted value is from the actual value.

Mean Absolute Error, or MAE, can be described as

The only difference between RMSE and MAE is in their approach of keeping the error positive. RMSE squares the residual while MAE takes the absolute value. Since it doesn’t square the difference, MAE places less emphasis on large deviations than RMSE. MAE, therefore, punishes large errors much less severely than RMSE does.

Implementation of RMSE and MAE in Spark

If you are not familiar with the use of resilient distributed datasets (RDD) in Spark, please check out my blog post on the topic as I will be using RDDs throughout this section.

Let’s consider the case of a recommender system trying to predict what rating a user will assign to a movie. We have a RDD called y_predicted that the recommender system algorithm outputs. y_predicted is in the form [(user_id, movie_id, predicted_rating)]. It lists the predicted rating for each user and movie pair. y_actual is the RDD that has the actual rating for each user and movie pair. Its format is [(user_id, movie_id, and actual_rating)]. To implement RMSE in Spark, we first have to reformat y_predicted and y_actual so that user_id and movie_id are used as the key.

y_predicted_reformat = y_predicted.map(
  lambda (user_id, movie_id, predicted_rating): 
         ((user_id, movie_id), predicted_rating)
)y_actual_reformat = y_actual.map(
   lambda (user_id, movie_id, actual_rating): 
          ((user_id, movie_id), actual_rating)
)

Once you have reformatted the RDD, you can join the result together so that you have both predicted_rating and actual_rating in the same RDD to compute their squared difference.

ratings_diff_sq = (y_predicted_reformat).join(y_actual_reformat) \
  .map(lambda (_, (predicted_rating, actual_rating)): \
              (predicted_rating — actual_rating) ** 2 )

You determine the average of the prediction errors by adding all the differences together with the reduce function and then dividing by the number of ratings.

sum_ratings_diff_sq = ratings_diff_sq.reduce(add)num = ratings_diff_sq.count()average_prediction_error = sum_ratings_diff_sq / float(num)

To scale it back down after squaring the difference, take the square root of the prediction error average.

rmse = sqrt(average_prediction_error)

Now that you know how to compute RMSE in Spark, your homework is to implement MAE. The answer can be found in Hermes’s GitHub repo, so go check that out after you are done here.

Although it is good practice to know how to implement RMSE and MAE in Spark, Spark’s MLlib has its own implementations of each that you should probably use. The only input required is a RDD containing the predicted and actual rating pair in the format of [(predicted_rating, actual_rating)].

predicted_and_actual_ratings =   
    y_predicted_reformat.join(y_actual_reformat) \
        .reduceByKey(lambda \
            predicted_rating, actual_rating: \
                predicted_rating + actual_rating) \
        .map(lambda \
            ((user_id, movie_id), \
             (predicted_rating, actual_rating)): \
                (predicted_rating, actual_rating)
    )from pyspark.mllib.evaluation import RegressionMetricsmetrics = RegressionMetrics(predicted_and_actual_ratings)
rmse = metrics.rootMeanSquaredError
mae = metrics.meanAbsoluteError

Classification Performance Metrics

RMSE and MAE allow us to compare systems that output non-binary values. If we need to compare recommender systems of binary values, we can employ accuracy, precision, recall or F-scores.

Before we dive deeper into what each of these metric entails, we should first go over what a confusion matrix is, as this will make it easier for us to understand the metrics. In a binary classification problem where an event can either occur or not occur, there are only four possible combinations of predicted and actual value.

We can either successfully predict whether or not the event occurs:

predict that the event is likely to occur and it does occur: true positive (TP)
predict that the event is not likely to occur and it does not occur: true negative (TN)

or we can fail to do so:

predict that the event is likely to occur but it does not occur: false negative (FP)
predict that the event is not likely to occur but it does occur: false positive (FN)

When your recommender has run on a dataset you can tally up each of these cases and put them in in a table, often called a confusion matrix.

Accuracy, Prediction, Recall and F-score

Let’s look at a similar case to the one we looked at for RMSE and MAE, but instead try to determine if a recommender system algorithm can assess whether or not a movie is good (assuming a clean binary distinction between good and bad movies). If an algorithm is accurate, we can just count the number of correct classifications over the total number of cases. We can express that as

Precision helps us assess the likelihood that a positive prediction (movie was good) is indeed positive. In fact, in other fields, precision also goes by the name “positive predictive value.” It is defined as the ratio of true positives to positive predictions:

Precision can be easily confused with recall. Recall is the proportion of correct positive classifications of cases that are actually positive. In the binary movie quality example, recall measures the proportion of good movies that the recommender system successfully recommends.

Precision and recall are inversely related. As we recommend more items, recall increases but precision decreases, and vice versa. We can combine precision and recall to provide a single measurement for a recommender system: the F-score. The traditional (“balanced”) F-score is defined as

Implementation of Precision, Recall, and F-score in Spark

Since Spark has its own library that computes precision, recall and F-score, we will demonstrate how to compute these metrics in MLlib and leave the manual implementation of each as homework. The only input the library requires is a RDD containing the predicted and actual classification pair. We will use a RDD called predicted_and_actual_classifications in the format of [(predicted_classification, actual_classification)]. We will also assume that the prediction and actual classification is either 0.0 or 1.0, where 0.0 is considered a “bad” movie and 1.0 is considered a “good” movie.

from pyspark.mllib.evaluation import RegressionMetricsmetrics = RegressionMetrics(predicted_and_actual_classifications)confusion_matrix = metrics.confusionMatrix().toArray()precision = metrics.precision()recall = metrics.recall()f1 = metrics.fMeasure()precision_for_bad_movies = metrics.precision(0.0)precision_for_good_movies = metrics.precision(1.0)recall_for_bad_movies = metrics.recall(0.0)recall_for_good_movies = metrics.recall(1.0)f1_for_bad_movies = metrics.fMeasure(0.0, 1.0)f1_for_good_movies = metrics.fMeasure(1.0, 1.0)

Conclusion

Accuracy and precision allow us to train a recommender system by minimizing the prediction error and then estimating the quality of the recommender system for other recommendations, where we have no ground truth. However, we must not lose sight of other metrics that can describe a given recommender system, as explained in Anna’s blog post on metrics for novelty, diversity, and serendipity in recommendations. I encourage you to take a look! If we focus too highly on accuracy, precision, and recall, our recommender system model might overfit the training data at the expense of producing quality recommendations. Although we only covered a small subset of performance metrics in this post, we hope you have a better understanding of their use. You can learn more about other performance metrics implemented in Spark by either reading the documentation or the source code. And if you are interested in Hermes, please check out our GitHub page. We hope to see you soon!

Originally published at www.lab41.org on March 13, 2016.