Building and evaluating your first recommendation engine with scikit-surprise

A small experiment to test our understanding of recommendation engines using the scientific method

Published in

Slalom Technology

9 min readMar 19, 2019

This post presents an approach to implement your first recommendation engine and quickly check its performance. It focuses on a use case designed to check visually what the model has learned during training and how it performed during testing. The use case was implemented in Python with pandas and scikit-surprise. I hope you will find this article helpful if you are getting started on recommendation engines yourself!

Preamble

Goal and approach

As I was getting started on recommendation engines, my primary goal was to confirm I understood how they work. The best way to do this was to implement one, test it and verify it behaved as I expected. So — after my initial research which led me into an extravaganza of collaborative filtering algorithms and Python libraries — I decided to adopt this straightforward approach to achieve my foundational goal:

Create a predictable recommendation use case: design the training/testing datasets to ensure they contain groups of users with similar taste.
Train the recommendation engine.
Test and verify visually that the engine responds according to the expectations set during step #1.

If I could do that, then I would understand how recommendation engines work (at a high level) and it was ok for me to study more advanced recommendation techniques.

Hypothesis tested

A recommendation engine is a piece of A.I. which takes a [user_id, product_id] combination as an input and outputs a rating(a number). In other words: it predicts how a specific user will like a specific product. Given this basic ground truth, I wanted to test my understanding (my “hypothesis”) that recommendation engines work like this:

Identify groups of users with the same taste (ie: which rated the same products similarly) during training.
Estimate how a user from a “same-taste-group” would rate a product already rated by other users of the “same-taste-group” during prediction.

Here is an example: say we have 10 users and 2 products (P1 and P2). Users 1 through 5 have rated P1 similarly, they have the same taste so they belong to the one “same-taste-group”. Users 1 through 4 have also rated P2, then it is reasonable to use these ratings to predict how user 5 will rate P2. Using this logic, we can imagine that users 6 through 10 could form a different “same-taste-group” based on their ratings of P1 and P2 or another set of products.

Let’s put that to the test!

Training data

The stratification into distinct groups of users and the introduction of clear like/dislike patterns amongst these groups is the cornerstone of the use case. The data used in this article was meticulously created to meet these specifications:

1000 distinct product_id
100 distinct user_id
Each user randomly picks 100 distinct product_id and rates them between 0.0 and 10.0 (don’t mind a decimal scale: this is a theoretical use case and it makes it easier to visualize)

Users are grouped into “same-taste-groups” as follows:

The 1st third of the users (category A, user_id between[1, 33]): rate product_id between [1, 500] high ([5.0, 10.0]) and rate product_id between [501, 1000] low ([0.0, 5.0]). These could be people with cats and product_id between [1, 500] could be products for cats.
The 2nd third of the users (category B, user_id between[34, 66]): rate product_id between [1, 500] low ([0.0, 5.0]) and rate product_id between [501, 1000] high ([5.0, 10.0]). These could be people with dogs and product_id between [501, 1000] could be products for dogs.
The 3rd third of the users (category C, user_id between[67, 100]): randomly rate all products between [0.0, 10.0] (uniformly distributed). There is no correlation between the product_id and the rating, (random taste - control group).

The table below provides a visual representation of the dataset used to train the recommendation model (pivoted [user_id, product_id, rating]). The table is quite large so rows /columns shown are truncated: the highest product_id shown is 993 (but they go as high as 1000) and the highest user_id shown is 76 (but they go as high as 100). NaN stands for Not A Number, they indicate the products which have not been rated by a user.

[product_id, rating] combinations for user categories A and B form recognizable and opposite “checker” patterns. This is not true for user category C where there is no pattern (that is by design). These patterns are key in the design of our use case, they will make it easy to check the performance of recommendation model visually.

Implementation: training a recommendation model with scikit-surprise

We have decided to use scikit-surprise to implement the recommendation engine for 2 reasons: this library focuses solely on recommendation engines and it is well documented.

Our use case data has been saved in a pandas dataframe (str_ratings_df), the first 5 rows of this dataframe are shown below. All of it will be used to train the recommendation model, (more explanation on the absence of an obvious test dataset later on).

The scikit-surprise code to train the model using the training set is:

from surprise import Reader, Dataset, SVD# Put the training data in the right format
reader = Reader(rating_scale = (0.0, 10.0))
data = Dataset.load_from_df(str_ratings_df, reader)
trainset = data.build_full_trainset()# Singular Values Decomposition algorithm
# gives good results according to the 
# scientific / engineering community
model = SVD(n_epochs = 20, n_factors = 50, verbose = True)
model.fit(trainset)

Voila! Our model object is trained and we can now ask it to predict user ratings!

Performance evaluation

Asking the right questions

In this article, we limit performance evaluation to a simple visual check since we are just looking to see if we can reject our understanding of how recommendation engines work (we are not calculating performance metrics or looking for the best model). We only want to verify our model has:

Learned the[product_id, rating] ”checker” patterns of each user category.
Makes sensible rating predictions when presented with combinations of [user_id, product_id] unseen during training.

This is done in 3 steps:

Find all the [user_id, product_id] combinations where product_id has not been rated by user_id but has been rated by other users in the user_id’s group.
For each valid [user_id, product_id] combination, use the recommendation model to get a predicted_rating.
Plot the [product_id, predicted_rating] combinations for each user category (A, B, C), verify the training pattern has indeed been learned and used to predict the ratings.

To make a prediction with a scikit-surpise model:

predicted_rating = model.predict(str(user_id), str(product_id))

Note: logically, the model should not be able to predict ratings for all possible [user_id, product_id] combinations. In particular: the model should not be able to predict ratings for these cases:

Unknown user_id or product_id: value not included in the training data: the model does not know what this user likes or who likes this product.
Unknown [user_id, product_id] association: the training data did not include a rating for this product_id coming from one of the users of this user_id‘s group.

Results

The dataframe below is a small extract of our predictions. Note the was_impossible flag: I suspect it is set to TRUE if we asked one of these “impossible questions” as described above. All of our flags were FALSE since we carefully selected the “possible” questions during our experiment.

Finally and as promised, our visual check: here are the visual representations of the [product_id, predicted_ratings] mappings. It is quite obvious that the model has learned the opposite “checker” patterns from the training data for categories A and B while it has found no pattern for user category C. Visually, our expectations are met.

Ratings predictions for users from group A

Ratings predictions for users from group B

Ratings predictions for users from group C

We can also look at the averages of the actual vs predicted rating per product and for each user category. The presence of a clear linear relationship between avg_rating avg_predicted_rating (with a slope close to 1) for categories A and B shows that the model learned the association between user_id and rating. There was no rating pattern to learn for user group C (since this group was designed to have unpredictable taste). There is almost no association between avg_rating and avg_predicted_rating for this user group.

Average actual vs. predicted ratings predictions for users from group A

Average actual vs. predicted ratings predictions for users from group B

Average actual vs. predicted ratings predictions for users from group C

These are the results I would expected to see if my understanding (my «hypothesis») was correct. So while I have not been able to reject my hypothesis – and am cognizant that it still could be wrong – the outcome of my experiment makes me more confident that my foundational understanding of recommendation engines is correct.

What would I do next?

First

Now that I have a basic foundational understanding of recommendation engines, I would go back to the theory and learn more about how the most popular algorithms work in details. I would form an opinion on which algorithm should work best under which circumstances and why.

Second

I would want to find the best way to solve this simple use case. As mentioned earlier, all we did was to visually check that “things made sense”. Additional effort is required to go from here to a situation where we have a “good model”. For this new goal, having a numeric performance metric becomes a must-have to navigate improvement candidates and progress efficiently. A possible approach to get a basic metric (to be minimized) is to:

Split the raw data between training and testing datasets.
Train using the training set — Duh :D.
Predict ratings using the [user_id, product_id] combinations from the testing dataset.
Calculate the average or sum of the absolute values of the normalized errors (ANE or SNE) between the predicted and actual ratings. By normalized, we mean divided by the potential maximum error at a particular rating value. Examples: a) for an actual rating of 0.0 or 10 the maximum absolute value of the prediction error is 10 — normalize the error using a factor of 1/10. b) for an actual rating of 5 the maximum absolute value of the prediction error is 5 — normalize the error using a factor of 1/5. This is to make sure that the errors at the edges count as much as these at the center of the prediction range.

The ANE or SNE can then be minimized to find the optimal set of hyperparameters for the recommendation model and/or choose the best algorithm (we only evaluated the SVD in this post, there are other options). Note: the normalize errors are a signed value, as such, I would also recommend keeping an eye on the normalized errors’ standard deviation or their density during the optimization process. (Is less error on average but more spread a real improvement)?