Photo by Helloquence on Unsplash

Evaluation Collaborative Filtering Recommender Systems

Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl.

Yoav Navon
3 min readAug 24, 2019

--

This is a long paper, so I’ll summarize only the most important aspects of it. The discussion revolves around different ways to evaluate a recommender system, and some of the relevant aspects to consider for a good evaluation.

Comparing two algorithms can be challenging because there is no consense about the most important attributes, and which metrics should be used to measure those attributes. Different algorithms may be better or worse on different data sets, and the goals for which evaluation is performed may differ.

Different systems may be built for different user tasks, like Find Good Items, Annotation in Context, Recommend Sequence, to name a few. Around this part of the paper there is a disturbing quote:

Most evaluations of recommender systems focus on the recommendations; however if users don’t rate items, then collaborative filtering recommender systems can’t provide recommendations.

This quote is clearly forgetting all the implicit feedback methods, where there is no rating whatsoever. However, the author was probably referring to the rating of items, as any kind of interaction with them.

Another aspect where there is variability in evaluation is with respect to the data sets. For a certain algorithm, can the evaluation be carried out offline, or does it require live user tests? Can evaluation be performed on simulated data? One topic that would have been interesting is if it’s possible to train a recommender system on datasets from more than one domain, for example, multiple movie rating datasets at the same time, with a posterior fine-tuning on the real movie dataset to be used.

Afterward, there is a long discussion about accuracy metrics, which are the most common kind of metric to evaluate recommender systems. One important aspect to consider is if a given metric measure the effectiveness of a system with respect to the user tasks for which it was designed; if it’s prediction, ranking or other. Different metrics are discussed, like Mean Absolute Error, Precision & Recall, F1, ROC Area, Correlation, Half-life Utility, and NDPM are the most notable. After explaining each one, it compares the metrics on the performance to evaluate one algorithm on one dataset with different parameters. This yielded correlation charts between pairs of metrics, and the authors find that each metric belongs to a certain family of metrics, identifying 3 of those families.

The correlation between metrics is interesting, but none of the discussion talks about how this metric determines the success of a certain recommender system. Would it be really interesting to evaluate an algorithm with some metrics, and then testing it with real users and measuring the overall satisfaction.

Finally, the paper talks about how accuracy can’t be the only metric to evaluate a recommender system. Coverage, Learning Rate, Novelty and Serendipity are different aspects to keep an eye on during the design and implementation of a recommender system.

--

--