Building and evaluating your first recommendation engine with scikit-surprise
A small experiment to test our understanding of recommendation engines using the scientific method
This post presents an approach to implement your first recommendation engine and quickly check its performance. It focuses on a use case designed to check visually what the model has learned during training and how it performed during testing. The use case was implemented in Python with pandas
and scikit-surprise
. I hope you will find this article helpful if you are getting started on recommendation engines yourself!
Preamble
Goal and approach
As I was getting started on recommendation engines, my primary goal was to confirm I understood how they work. The best way to do this was to implement one, test it and verify it behaved as I expected. So — after my initial research which led me into an extravaganza of collaborative filtering algorithms and Python libraries — I decided to adopt this straightforward approach to achieve my foundational goal:
- Create a predictable recommendation use case: design the training/testing datasets to ensure they contain groups of users with similar taste.
- Train the recommendation engine.
- Test and verify visually that the engine responds according to the expectations set during step #1.
If I could do that, then I would understand how recommendation engines work (at a high level) and it was ok for me to study more advanced recommendation techniques.
Hypothesis tested
A recommendation engine is a piece of A.I. which takes a [user_id, product_id]
combination as an input and outputs a rating
(a number). In other words: it predicts how a specific user will like a specific product. Given this basic ground truth, I wanted to test my understanding (my “hypothesis”) that recommendation engines work like this:
- Identify groups of users with the same taste (ie: which rated the same products similarly) during training.
- Estimate how a user from a “same-taste-group” would rate a product already rated by other users of the “same-taste-group” during prediction.
Here is an example: say we have 10 users and 2 products (P1
and P2
). Users 1 through 5 have rated P1
similarly, they have the same taste so they belong to the one “same-taste-group”. Users 1 through 4 have also rated P2
, then it is reasonable to use these ratings to predict how user 5 will rate P2
. Using this logic, we can imagine that users 6 through 10 could form a different “same-taste-group” based on their ratings of P1
and P2
or another set of products.
Let’s put that to the test!
Training data
The stratification into distinct groups of users and the introduction of clear like/dislike patterns amongst these groups is the cornerstone of the use case. The data used in this article was meticulously created to meet these specifications:
1000
distinctproduct_id
100
distinctuser_id
- Each user randomly picks
100
distinctproduct_id
and rates them between0.0
and10.0
(don’t mind a decimal scale: this is a theoretical use case and it makes it easier to visualize)
Users are grouped into “same-taste-groups” as follows:
- The 1st third of the users (category
A
,user_id
between[1, 33]
): rateproduct_id
between[1, 500]
high ([5.0, 10.0]
) and rateproduct_id
between[501, 1000]
low ([0.0, 5.0]
). These could be people with cats andproduct_id
between[1, 500]
could be products for cats. - The 2nd third of the users (category
B
,user_id
between[34, 66]
): rateproduct_id
between[1, 500]
low ([0.0, 5.0]
) and rateproduct_id
between[501, 1000]
high ([5.0, 10.0]
). These could be people with dogs andproduct_id
between[501, 1000]
could be products for dogs. - The 3rd third of the users (category
C
,user_id
between[67, 100]
): randomly rate all products between[0.0, 10.0]
(uniformly distributed). There is no correlation between theproduct_id
and therating
, (random taste - control group).
The table below provides a visual representation of the dataset used to train the recommendation model (pivoted [user_id, product_id, rating]
). The table is quite large so rows /columns shown are truncated: the highest product_id
shown is 993
(but they go as high as 1000
) and the highest user_id
shown is 76
(but they go as high as 100
). NaN
stands for Not A Number, they indicate the products which have not been rated by a user.
[product_id, rating]
combinations for user categories A
and B
form recognizable and opposite “checker” patterns. This is not true for user category C
where there is no pattern (that is by design). These patterns are key in the design of our use case, they will make it easy to check the performance of recommendation model visually.
Implementation: training a recommendation model with scikit-surprise
We have decided to use scikit-surprise
to implement the recommendation engine for 2 reasons: this library focuses solely on recommendation engines and it is well documented.
Our use case data has been saved in a pandas
dataframe (str_ratings_df
), the first 5 rows of this dataframe are shown below. All of it will be used to train the recommendation model, (more explanation on the absence of an obvious test dataset later on).
training dataset (str_ratings_df)
The scikit-surprise
code to train the model using the training set is:
from surprise import Reader, Dataset, SVD# Put the training data in the right format
reader = Reader(rating_scale = (0.0, 10.0))
data = Dataset.load_from_df(str_ratings_df, reader)
trainset = data.build_full_trainset()# Singular Values Decomposition algorithm
# gives good results according to the
# scientific / engineering community
model = SVD(n_epochs = 20, n_factors = 50, verbose = True)
model.fit(trainset)
Voila! Our model
object is trained and we can now ask it to predict user ratings!
Performance evaluation
Asking the right questions
In this article, we limit performance evaluation to a simple visual check since we are just looking to see if we can reject our understanding of how recommendation engines work (we are not calculating performance metrics or looking for the best model). We only want to verify our model has:
- Learned the
[product_id, rating]
”checker” patterns of each user category. - Makes sensible rating predictions when presented with combinations of
[user_id, product_id]
unseen during training.
This is done in 3 steps:
- Find all the
[user_id, product_id]
combinations whereproduct_id
has not been rated byuser_id
but has been rated by other users in theuser_id
’s group. - For each valid
[user_id, product_id]
combination, use the recommendation model to get apredicted_rating
. - Plot the
[product_id, predicted_rating]
combinations for each user category (A
,B
,C
), verify the training pattern has indeed been learned and used to predict the ratings.
To make a prediction with a scikit-surpise
model:
predicted_rating = model.predict(str(user_id), str(product_id))
Note: logically, the model should not be able to predict ratings for all possible [user_id, product_id]
combinations. In particular: the model should not be able to predict ratings for these cases:
- Unknown
user_id
orproduct_id
: value not included in the training data: the model does not know what this user likes or who likes this product. - Unknown
[user_id, product_id]
association: the training data did not include a rating for thisproduct_id
coming from one of the users of thisuser_id
‘s group.
Results
The dataframe below is a small extract of our predictions. Note the was_impossible
flag: I suspect it is set to TRUE
if we asked one of these “impossible questions” as described above. All of our flags were FALSE
since we carefully selected the “possible” questions during our experiment.
Finally and as promised, our visual check: here are the visual representations of the [product_id, predicted_ratings]
mappings. It is quite obvious that the model has learned the opposite “checker” patterns from the training data for categories A
and B
while it has found no pattern for user category C
. Visually, our expectations are met.
We can also look at the averages of the actual vs predicted rating per product and for each user category. The presence of a clear linear relationship between avg_rating
avg_predicted_rating
(with a slope close to 1) for categories A
and B
shows that the model learned the association between user_id
and rating
. There was no rating pattern to learn for user group C
(since this group was designed to have unpredictable taste). There is almost no association between avg_rating
and avg_predicted_rating
for this user group.
These are the results I would expected to see if my understanding (my «hypothesis») was correct. So while I have not been able to reject my hypothesis – and am cognizant that it still could be wrong – the outcome of my experiment makes me more confident that my foundational understanding of recommendation engines is correct.
What would I do next?
First
Now that I have a basic foundational understanding of recommendation engines, I would go back to the theory and learn more about how the most popular algorithms work in details. I would form an opinion on which algorithm should work best under which circumstances and why.
Second
I would want to find the best way to solve this simple use case. As mentioned earlier, all we did was to visually check that “things made sense”. Additional effort is required to go from here to a situation where we have a “good model”. For this new goal, having a numeric performance metric becomes a must-have to navigate improvement candidates and progress efficiently. A possible approach to get a basic metric (to be minimized) is to:
- Split the raw data between training and testing datasets.
- Train using the training set — Duh :D.
- Predict ratings using the
[user_id, product_id]
combinations from the testing dataset. - Calculate the average or sum of the absolute values of the normalized errors (ANE or SNE) between the predicted and actual ratings. By normalized, we mean divided by the potential maximum error at a particular rating value. Examples: a) for an actual rating of
0.0
or10
the maximum absolute value of the prediction error is10
— normalize the error using a factor of1/10
. b) for an actual rating of5
the maximum absolute value of the prediction error is5
— normalize the error using a factor of1/5
. This is to make sure that the errors at the edges count as much as these at the center of the prediction range.
The ANE or SNE can then be minimized to find the optimal set of hyperparameters for the recommendation model and/or choose the best algorithm (we only evaluated the SVD in this post, there are other options). Note: the normalize errors are a signed value, as such, I would also recommend keeping an eye on the normalized errors’ standard deviation or their density during the optimization process. (Is less error on average but more spread a real improvement)?