Some of my current music recommendations in Amazon. Not bad, but not perfect based on the music I have been listening in amazon music and/or purchasing lately in their site.

RecoTour: a tour through recommendation algorithms in python

A while ago a friend of mine asked me about approaches that would be useful when optimizing GBMs. I had been asked this question a few times in the past, so I thought I could share some code and write a post about it, just in case someone finds it useful. During the process of re-writing the code, I thought I could give it some context and illustrate how one could optimise a GBM in the process of building a recommendation algorithm. I then finished the script and thought that I could also include some other algorithms and techniques used when building a recommendation algorithm. One thing lead to another and produced RecoTour, a (humble) tour through some recommendation algorithms in python.

Then, over one month after the last commit! (you will read why) here I am…writing the post. In this post, I will briefly go through the content of the repo, so you can chose which notebooks might be useful for you (if any). I say “briefly” because the repo contains a lot of code and text. Therefore, if you want details, please, have a look to the corresponding notebook and let me know if you have any comments.

To start with, you will notice that I picked the dataset of Kaggle’s Ponpare competition. For that competition, we were given historical information about coupon purchases and browsing behaviour and we needed to predict which coupons/items a customer will buy in a given period of time (for example, the coming week). The main reason to pick this dataset is because it is really rich in terms of preprocessing and feature engineering. In addition, the nature of the problem allows for a number of different representations of the data (tabular, matrix, …), so that different techniques can be used. In the future I will include some other datasets that will be better suited for some of the techniques that I illustrate in the repo.

As I mentioned before, there is a lot of code and content, so I decided to structure the repo in “Chapters” that are jupyter notebooks and represent the core of the repo. The python scripts that these notebooks are based on are in thepy_scripts directory. The notebooks intend to be self-contained and in consequence, there is some repetition. The code in the notebooks is, of course, “notebook-oriented”. I have included a more modular, nicer looking version of a possible “real world solution” in the directory final_recommendations which is described in detail in Chaper16_final_solution_Recommendations.ipynb .

The notebooks have plenty of explanations and references to relevant papers or packages, some of them are also included here. My intention through the process was to focus on the code and the techniques, but you will also find some math. All the code in the notebooks did run on a c5.4xlarge instance or a p2.xlarge instance when running deep learning algorithms.

This is what you will find in the notebooks:

These chapters include data translation (part of the dataset comes in Japanese), train/validation/test split and user/item feature engineering. The code in this chapters would be useful for a number of ML-related problems. There are some things to note. For example, the problem we face here has a time component so that we cannot split the data at random. While this is not uncommon, it adds some interesting considerations when performing the data split that might be useful for some data scientists out there. In addition, the interplay between users and items information allows for a large number of possibilities when doing feature engineering. It is worth mentioning that all the featuring engineering in the notebooks is done by hand. Some of the process could have been shortened by using tools such as for example featuretools.

This chapter introduces the metric that we will use to evaluate our performance, namely the Mean Average Precision (MAP). For completion, I decided to include at least another evaluation metric for recommendation algorithms. With this in mind, there is a script called using_ncdg.py in the directory py_scripts that is intended to illustrate how one would use the Normalized Discounted Cumulative Gain (NDCG) for this particular project and dataset.

This will be our baseline. All the following algorithms will be compared to the results obtaining when recommending the most popular items. Note that the definition of “most popular” allows for some flexibility. With this in mind, there is a discussion in the notebooks regarding the potentially different results one would obtain using different definitions.

  • Chapter 7: user-item distance-based recommendations.

In this approach users and items are projected onto the same feature space and users will be recommended those items that are closer to them. Here there is all the freedom to define the feature space and even the distance metric. This is discussed in the notebook.

  • Chapter 8: building the interactions/interest data set.

For most of the techniques/algorithms included in the repo we need a measure of “interest”, i.e. how interested a user is in an item. As in previous Chapters, we have a lot of freedom in defining “interest”. For example, we might want to bias the measure of interest towards the most recent interactions. In this case we will have to include a factor that increases the interest of a user in an item if the interaction with that item occurred recently. Also, when computing the interest, we might not want to consider all items. These and other considerations are discussed in the Chapter 8 notebook.

  • Chapter 9: KNN item-based collaborative filtering (CF).

Our good, old (and normally very well functioning) friend. There are tones of literature explaining KNN CF and you can find some details in the notebook, so I will not go into more detail here.

However, as straightforward as this approach might sound (and it normally is), there is a caveat when addressing the “Ponpare problem”, and in general any scenario where the items to recommend have never been seen before (they are…cold). I have included in the repo a way around this issue by using a measure of item-similarity.

In reality this technique might not be the best suited technique for this particular problem, but I thought that any “tour” through recommendation algorithms without at least illustrating the use of CF is not complete.

This is actually the code that I initially aimed to share when my good friend asked me about optimizing GBMs .

Here, we simply recommend items based on a regression using a GBM, in particular, lightGBM. LightGBM has become one of my favorite algorithms/libraries in later times and I believe every data scientist should have it in their repertoire, along with xgboost and catboost.

The repo includes a full “tutorial” on how to optimise a GBM using hyperopt, and code related to model interpretability using LIME and SHAP . In previous chapters I compare hyperopt with skopt. Results are mostly identical, so choosing one or the other is a matter of preference I guess (there are a number of libraries emerging these days that you might find useful in terms of hyperparameter optimization. One I started to use recently is hyperparameter_hunter).

I can only say that the approach described in this Chapter has worked well for me in the past, going to production a couple of times.

Like KNN CF, this is a well known, widely used algorithm with a lot of literature and examples available online. Nonetheless, I have included an explanation with some math in the notebook and a link to a good tutorial.

This technique suffers from the same “limitations” as the ones described in Chapter 9, and might not be the best solution for this particular problem. Still, this algorithm has to be included in any comprehensive list of recommendation algorithms.

The text in this notebook is partially based on this paper, where the authors present the xlearn package. A nice package that implements a linear model (LR), factorization machines (FM), and field-aware factorization machines (we’ll come back to this later). Note that since the time I wrote the notebook (nearly three months ago) a lot has happened around the package. Therefore, chances are that all the memory limitations and drawbacks that I experienced while using the package are now gone. In general terms, I do recommend to anyone interested in recommendation algorithms to get familiar with both the paper and the package.

This is an extension of the FMs described in the Chapter 12 notebook. Here latent factors are also learned relative to the field corresponding to a given feature. For example, you might have a feature called “Advertiser” and a value of that feature might be “Adidas”. When computing the latent factors, the fact that “Adidas” is a value of the feature “Advertiser” is encoded in the data. This might sound complex, but the paper is quite accessible and things will become clearer after reading it. In addition I have also included an explanation in the notebook with code on how to build the dataset. Altogether, I hope it helps to understand how the field information is included when preparing the data.

This is the first deep learning algorithm that I include. Here, a series of dense layers (deep) are combined with a linear model (wide) in a regression or a classification exercise. Given the nature and size of the data this technique is certainly not suited for this particular problem, as I discuss in the notebook. However, I thought it would be useful to include it and compare how a relatively simple deep learning approach compares to other more “traditional approaches”.

In the past, I have tried this technique in bigger, “real-world” datasets and I can say that it was the only technique that performed better than the GBM approach described before. Given the relative simplicity of the model and how accessible is to build DL models these days, I would recommend to give it a go if your dataset is suitable.

Note that even though I described the “deep side” of the model as a series of dense layers, it can be anything you want, such as a Convolutional or Recurrent neural network if necessary. I find this technique really flexible and, given the adequate data, quite powerful.

After having used all the previous algorithms we have to pick one as “the winner” (surprise, surprise, lightGBM) and build what it could be a potential final solution. Details on the solution and the reasoning behind the process of building the recommendation algorithm can be found in the Chapter 16 notebook. A companion, modular version of such solution is included in the directory final_recommendations .

One of the reasons why it has taken me some time to write this post is because I wanted to include some content on NCF. When I first read the paper I thought that it was a natural extension of Wide and Deep. Also, given the relative simplicity of the model I thought I could illustrate the use of different DL frames, namely Keras (the Keras code is mostly based on the code released with the original paper, so full credit to the authors), Gluon and Pytorch. Little by little the size of the code grew so I decided to split the process into two parts: 1) in a separate repo, I illustrate the use of the 3 DL frames with the well known movielens dataset and compare the results with those in original paper and 2) use the model on the Ponpare dataset and compare with the previous algorithms described in the notebooks.

So far I have only completed “1” and the results are rather interesting. The Figure below shows the Hit Ratio (HR) and Normalised Discounted Cumulative Gain (NDCG) at k=10 for a multi-layer perceptron (MLP) and a general Matrix Factorisation model per number of embeddings. The training time per number of embeddings is also shown in the Figure for the MLP model.

Top: Hit Ratio (HR) and Normalised Discounted Cumulative Gain (NDCG) at k=10 for both the GMF and MLP models vs the number of embeddings. Bottom: training time for the MLP model per number of embeddings

In general terms, I find that these results deserve some more attention. I managed to reproduce the results in the paper, which is reassuring. However, leaving aside the running times, the performance of the models when using different DL frames is drastically different. There is some discussion in this notebook in the repo, but at this stage, I can’t find a good reason to explain such different results.

Future “Work”

As I mentioned before, one of the reasons why it took me some time to sit and write this post is because I wanted to include NCF. Since then, around 6 weeks ago, I have been traveling around the world, and I will be for some time. Therefore, while it is unlikely that I add any content any time soon, this is what I intend to add as soon as I have the time:

  • Other datasets: I want to add datasets that are better suited for DL techniques, e.g. datasets including text and/or images.
  • RNN-based recommendations: a lot of recommendation algorithms can be framed as a sequence of steps that the users follow as they navigate websites or apps. Such a representation is well suited for RNN-based recommendations.
  • Bayesian techniques: I would like to explore the performance of a bayesian regression compared to other methods. Also, I would like to use the opportunity to “play” with Pyro.

And for the time being that’s it. In general, I hope someone finds the content of the repos useful.

All the code referred to in this post can be found here and here. In case you have any comments, email me: jrzaurin@gmail.com