Building (and Evaluating) a Recommender System for Implicit Feedback

Juliana Daikawa
6 min readFeb 1, 2020

--

Recommender Systems are everywhere. They are present when choosing what movie to watch on Netflix, what book to buy on Amazon, find friends on Facebook…

With more and more content available, it becomes a need to have just a part of the whole selected to us, and better than just that, personalized.

Explicit Feedback x Implicit Feedback

It’s easy to see their applications and why Recommender Systems are so popular. Although there are a lot of articles with examples out there to learn, most of them are for explicit feedback — like star rating of movies on IMDB. It is clear what content the user likes (rated 5 stars) and what they don’t like (rated 1 star). In this case, we could try to predict the rating the user would give to an unseen movie and recommend those close to 5 stars.

In this article, we will focus on an approach for when we only have implicit feedback, like if a user purchased a product or not. This is implicit because we only know that the customer bought the items, but we can’t tell if they liked or which one they preferred. Some other examples of implicit feedback are the number of clicks, number of page visits, the number of times a song was played, etc…

When dealing with implicit feedback, we can look at the number of occurrences to infer the user’s preference, but that can lead to bias towards categories bought on a daily basis.

Algorithms

There are three main ways to build a recommender system:

  • Content-Based

Uses descriptions of the items to build the profile of the user’s preferences.

  • Collaborative Filtering

Based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past.

  • Hybrid

Combines collaborative filtering, content-based filtering, and other approaches.

In this post, we’ll take a closer look at Collaborative Filtering, as well as how to evaluate our model.

Collaborative Filtering

There are two types of collaborative filtering: user-based and item-based.

In user-based, we look for similarities between all users. For each user, we search for other users with similar tastes to make the recommendations. In the image illustrated above, User A has a greater similarity with User C (both of them like strawberry and watermelon), so we could say that whatever User A likes, User C will like too, so we’ll recommend orange to User C.

On the other hand, item-based filtering will look at similarities between the items and not the users. We have that User C likes watermelon and most people that like watermelon, also like grape, thus we recommend grape to User C.

Item-based CF is generally preferred over user-based CF since the latter is a lot more computationally expensive, given we usually have more users than items.

Evaluation

After building our model, we should evaluate to check its quality. We could use standard metrics such as MSE for explicit feedback and F1-score for implicit feedback.

However, recommender systems models are quite different from what we may be used to. This is because order matters when giving a list of recommendations.

Imagine you searched something on Google, and what you were looking for was only on the 2nd page. If we look at metrics such as Recall it would be a bit naive.

Or worse, imagine we want to send personalized e-mails and we want to give our users only the top 5 most related items. It would be really bad if we get all of the 5 wrongs (no matter if we get it right on the following items).

MAP@K

For this reason, a more suitable evaluation metric is MAP@K. It stands for Mean Average Precision @ (cutoff) K.

Using this metric to evaluate a recommender algorithm implies that you are treating the recommendation like a ranking task, which most times makes perfect sense! We want the most likely/relevant items to be shown first.

A quick fresh-up on the definition of Precision and Recall:

In the context of a recommender system for movies, would be:

Precision = out of the total movies recommended, # of movies the user did like.

Recall = out of the total movies the user would like, # of movies that we recommended.

These two are great metrics but they don’t seem to care about ordering. This is when MAP@K and MAR@K come in hand.

Imagine taking our ranked list of recommendation of size N and considering only the first item, then only the first two, then only the first three, and so on… We can calculate the precision at each cutoff, as illustrated below. (We could also do recall, that would give us in the end MAR@K).

We then average these precisions P(k=i) for every i that was actually correct, that is:

where rel(k) is 1 if the kᵗʰ item was correct, otherwise 0.

For each user, we calculate the AP@N. Here’s an example with N=3:

User A: [0*(0) + 0*(0) + 1*(1/3)]/3 = (1/3)*(1/3) = 0.11

User B: [0*(0) + 1*(1/2) + 1*(2/3)]/3 = (1/3)*[(1/2)+(2/3)] = 0.38

User C: [1*(1/1) + 1*(2/2) + 0*(2/3)]/3 = (1/3)*[(1/1)+(2/2)] = 0.66

We see that even though we got 2 items correct for both User B and User C, the recommendations for User C were better because they appeared earlier.

Finally, we take all our users and take the average (mean)!

In this case, our model would have a MAP@3 of 0.3833.

Code

We can build a simple recommender system with just a few lines of code, using Turicreate in Python. You can read more about it in the documentation.

For evaluation, the ml_metrics package can be used and an easy demo can be found here.

The Instacart Market Basket Analysis dataset from Kaggle is a nice dataset to play with, to try predicting which products will be in a user’s next order.

Have fun! :-)

Further reading

  • Wikipedia:
  • Stanford online book:

--

--