Recommendation Engines 101

Published in

data from the trenches

9 min readJan 8, 2018

Get Started & Master the Basics

Recommendation engines are everywhere today, whether explicitly offered to users (e.g., Amazon or Netflix, the classic examples) or working behind the scenes to choose which content to surface without giving the user a choice. And while building a simple recommendation engine can be quite straightforward, the real challenge is to actually build one that works and where the business sees real uplift and value from its output.

Here, I’ll focus on the straightforward bit, presenting the main components of a recommendation engine project so that you can (hopefully) get the easy part out of the way and focus on the business value. Specifically, we’ll explore building a collaborative filtering recommender (both item- and user-based) and a content-based recommender.

We’ll work in SQL (Postgres flavor) and Python, and
we’ll work with two generic datasets throughout the rest of this post:

A visit dataset with format: user_id x product_id x visit

where visit is a binary indicator (1 if the user visited the product, 0 otherwise).But this could also be a rating in case of explicit feedback.

A product description dataset with schema: product_id x description

Getting the Data Ready

When preparing data to use for a recommendation engine, the first thing to do is some normalization since you’ll need it for any of the recommendation scores — normalization will scale every score between 0 and 1 so that it’s possible to compare things against each other to understand what is a good recommendation and what’s not.

As the users and products often follow a long-tail distribution, we will also cut the long tail by filtering on users and products that occur frequently enough. You can see this part below:

The following SQL query will produce a table called visit_normalized with format:

product_id x user_id x nb_visit_user x nb_visit_product x visit_user_normed x visit_product_normed

Collaborative Filtering

There are two main types of collaborative filtering: user-based and item-based. Note that the two are entirely symmetric (or more precisely the transpose of each other). For both versions, we first need to compute a score of similarity between two users or between two products.

User-Based

Again, with user-based collaborative filtering, the key is to calculate a user similarity score. The following will produce a table user_similarity with format: user_1 x user_2 x similarity_user

As there are usually too many pairs of users to score, we often restrict this query (cf. the condition WHERE b.rank <= 10) by limiting ourselves to the 10 highest user similarity scores per user. The table has then format user_1 x user_2 x similarity_user x rank and our query becomes:

We can then use this user similarity to score a product for a given user user_j :

where 𝛿user, new_product (thanks Medium for this beautiful math rendering) is equal to 1 if the user has seen the new_product and 0 otherwise.

More generally, we could change 𝛿user, new_product by:

The number of times user has seen the new product.
The time spent by user on the new product.
How long ago the user visited this product.

This is done using the following SQL script where the score_cf_user table will have the following format:

user_id x product_id x score_user_product

Item-Based

We proceed similarly for the item-based collaborative filtering; we just need to transpose the above to produce the table product_similarity containing:

Product_1 x Product_2 x similarity_product

As usual, we restrict to the products that have been seen enough times (I chose 25, in this case — you can see below that this is done by the filtered_visit table).

And once again, we can then use this product similarity to score a product for a given user:

where 𝛿i, new_product is equal to 1 if the user has seen the new_product and 0 otherwise.

Again, let’s restrict the query to the 10 most similar products for a given product (cf. the condition WHERE final.rank <=10). The table score_cf_product has then format user_id x product_id x score_user_product x rank and our query becomes:

Content-Based

With content-based recommendation engines, the overall philosophy is a bit different. Here, the goal is to use the description of the products to find products are similar to those that the user already bought or visited. To do that, we need a product classification, and we often have to build it from a text field. So that’s what we’ll walk through here.

TFIDF Version

We can describe each product by the (non-zero) term frequency–inverse document frequency (TFIDF) value of the words of its description.

The output dataset product_enriched_words has the following schema:

product_id x word x value

Given as follows:

By summing the product vectors seen by each user, we obtain the user profile:

This is done in with the following recipe whose output table user_profile_cb_words has format: user_id x word x value x value_normed

Finally, thanks to the user’s profile, we can score products :

(where <,> is your favorite scalar product)

This will give us the table score_cb_words with format: user_id x product_id x score_cb x score_cb_normed

Non-Negative Matrix Factorization (NMF) Version

Another way to proceed on a content-based recommendation engine is to extract topics from the vector description. This allows for drastically reducing the dimension space, expressing each product by a vector of length n_topics.

Here, we present two versions of this method:

A long format version for a direct SQL scoring
A wide format version for a batch scoring in Python

Wide Format

This code is directly taken from the sklearn documentation. We only have to choose the number of desired topics, n_topics, and for displaying information about each topic, n_top_words.

The output dataset product_enriched_topics has format:

product_id x description x topic_0 x … x topic_9

Like the TFIDF version, we can now deduce a user’s profile based on the NMF topics values. This first part computes the average topic values for each user, and the output table user_profile_cb has format:

user_id x topic0 x…x topic9 x topic0_normed x … x topic9_normed :

Now, we can rescale those values by the average topic values over all users:

Finally, we can score new product in batch in Python with the following. The output dataset score_cb has then format:

user_id x product_id x score_cb x score_cb_normed

Long Format

Starting from the NMF’s output product_enriched_topics, we can reshape this into a long format, product_enriched_topics_2, with schema: product_id x topic x value as follow :

As before, we can now easily compute the user profile and this gives us the table user_profile_cb_2 with format: user_id x topic x value x value_normed

Last but not least, we can finally score in SQL :

Note that this may takes quite some time to run… but in the meantime — read on!

Mixing These Approaches — Philosophy

We just saw how to compute several affinity scores, but you might be asking yourself: OK, which one do I take as my final result? Well, none of them, necessarily.

It’d be silly to arbitrarily choose one. We should instead be able to make some “average” of these different scores to compute our final affinity User/Product. How can we do that ?

The idea is to get the optimal combination between these scores. I’m sure it already rings a bell in your head… Yes, it’s a classical ML problem. We want to predict if a user will visit/buy/like a given product, and in order to do that we have a set of features which are the affinity scores we computed.

However, there will be one major difficulty: we only have positive examples to learn. Indeed, we always have the visits/buys/likes, but in many cases we don’t have the counterfactual events (didn’t visit/buy/like). This means that we will have to “create” those. This is what we call “negative sampling”.

Creating the Learning Set

The major difficulty in predicting if a user will visit/buy/like a given product is that we only have positive examples from which to learn. Indeed, we always have the visits, buys, and likes, but in many cases, we don’t have the counterfactual events (i.e., didn’t visit, didn’t buy, didn’t like).

This means that we will have to create those using what we call “negative sampling.” Imagine that you have your visit/buy user data until a given date T (which is probably today or yesterday). You will compute your several affinity scores by taking into account the data since a date T — x days, where x is a buffer you chose.

Then, you will be able to create a learning set with the data from T — x days until T. During this period of time, you have a set of User/Product couples that are “true” (i.e., the user bought the product during the period). But you don’t have negative examples, so you will have to create User/Product couples that are “false” (i.e., the user didn’t buy the product during the period). For each couple, your features will be the affinity scores you just computed, and the target will be “true” or “false”. You try to find the best combination to optimise the visits, buys, and likes.

How many “false” couples should you create? Well, it’s a good question, and there is no real answer to this. It should be enough to have an unbalanced dataset, since in reality, the events visit, buy, or like are actually very unlikely to happen. How can I create «false» couples ? There are several strategies:

Randomly (not a very good one).
Randomly, but only picking users that visited, bought, or liked at least one thing during the period. We don’t pick users with only negative examples during that period because otherwise it would be too easy to have a good score, and it would not be very relevant.
Excluding couples that happened in the past (if someone already bought something you don’t want to penalize it). We don’t take as fake example a true example from the past.

Now that your learning set is ready, you just have to split into a train and test set to create and evaluate your model. For the best possible results, you should do this split on a time basis.

Let’s say now that we have new columns date in our visit dataset:

user_id x product_id x visit x date

Here is the SQL code to create a learning set as described above with strategy number three:

Limitations

With the approach we’ve walked through here, note that once you trained a model based on your affinity scores, you will have to score all possible user/product couples. So obviously it can be extremely expensive if you have a huge catalog of products, which is why this approach better suits limited catalogs of products or content. We can think about ways to tackle this issue, but it will always mean limiting the set of available products (this limitation can be drawn by marketing rules, for example).

We also see that ultimately, the choice of the algorithm will be very important, because the scoring time will be completely different depending on this choice. A logistic regression could be a good choice in that case since it’s a linear combination which allows a really fast scoring.

So there you have it! You now have the basics with which to create a simple recommendation engine. We’ll have some more posts coming up that apply this to some real-life data (sneak preview: it may or may not involve beer [!!]), so stay tuned.