The Best-Worst Recommendation System: PySpark with more data than you can handle. An Abstract Explanation on Recommendations.

Data Ninja
Solving the Human Problem
7 min readSep 19, 2020

Be ready to get naught with the data and don’t forget to clap if you like it.

Photo by Jukan Tateisi on Unsplash

From Netflix’s movies recommendation to Amazon’s purchases or videos in TikTok, recommendations are one of the most important pieces in any modern machine learning system. But the burning question is:

How to create a recommendation system that hit these three marks:

  1. Consider users profile information (age, gender, etc..);
  2. Consider users past behaviour; and
  3. Handles more data than I can fit in memory.

It sounds obvious that a recommendation system should consider (at least) these three points, but often one can find examples and explanations that consider only one of these three problems, but not all the three together. For example, this Kaggle notebook (link) for AirBnB booking recommendation that has a tightly put together example of case 1, but no mention of cases 2. or 3. Or this excellent Kaggle notebook (link) for movie recommendation that explains case 2. much better that I would be able to, but makes no reference of point 1. or 3. Or yet, this TowardsDataScience article (link) that focus on topic 3 but skips on 1 and 2.

In this article I will share some code and some ideas on how to implement a simple and intuitive recommendation system that takes into consideration user data (content based filtering), user past purchases (collaborative filtering) and can handle more data than available memory (via PySpark).

In this article you will learn how the following works (together with some Python code):

  1. Content Based Filtering — Recommendations based on user information
  2. Collaborative Filtering — Recommendations based on user purchases
  3. PySpark 3.0.0 — Library for distributed computation based on Hadoop.

Finally, in the git repository for this article, beyond the code for the models and metrics, you will be able to find a neat report on how the model was produced, the metrics and exploration to find it and a link to download some artificial data to try to run your own models using PySpark. Here is the repo:

A hybrid Recommendation System.

I mean by Hybrid System a system using Contend Based and Collaborative filtering. Since these are our building blocks, let’s dive into each of them.

Content Based Filtering

The name content based comes from the idea that an item (or content) to be recommended has features and those features are used to predict which items will suite the best different users. We will use this idea, but instead of recommending items to users, we will recommend users to items.

The goal is to learn the different user’s characteristics (or features) and predict which items are the most likely to be purchased by a user. The prediction is often by fitting a model to the data, say a logistic regression for classification (more about that later), and predicting their next purchases from the training data. These features are often characteristics like age, location, gender, operational system, etc.

K-means convergence gif.

To keep this article abstract I provided some data (see the linked git repository) where the user’s characteristics are all called feature1, feature2, ..., feature6 and their values are all categorical values (ranging from A1, A2 to F1, F2, …, F10. Since all the data is categorical, the strategy will be to hot-encode them. A quick note: in general it is a very bad idea to hot-encode all your features, one example is a feature like age, or income. For your own problem, you should find ways to tidy the dimensions of your feature space as much as possible. Here is an article with a few ideas on alternatives to hot-encode. That said, we will keep all our variables hot-encoded.

The idea of ‘Content Based Filtering’ is that users that have similar characteristics should have similar tastes.

PySpark Code Example for K-Means Fitting

The model we will choose to fit to our features will be a clustering algorithm called K-means. The idea is that users that have similar characteristics should have similar tastes. Moreover, we choose K-Means because of its relative low complexity leading to faster training time. I have a longer discussion on the complexity and training time in the pdf accompanying the git repository, here is the link. In any case, if you are following this article for your own use case, the same Python code discussed below can be used to most of the Machine Learning models in Spark, see: https://spark.apache.org/docs/0.9.0/mllib-guide.html for a list of available models.

If you need help installing Spark in your machine, you can check my setup here.

Before calling Spark with Python, you need to create a pySpark context that will receive your Spark commands, here is how:

With your Spark context it is just a matter of applying the hot-encoder to the data and fitting the K-means model. Here’s how:

The next step would be finding the most popular items per cluster, and given any user, we would find its own cluster and recommend the most popular items within their cluster. Feel free to see the details in this notebook in our reference repository. With this we are done with our Content based piece of the system. Let’s jump to the Collaborative piece.

The idea of ‘Collaborative filtering’ is that users that have similar purchase pattern should be interested in similar items.

Collaborative Filtering

The idea of a Collaborative filtering is that users that have similar purchase pattern should be interested in similar items. The way we will approach this is by using the number of times an item was bought by a user as a proxy measure of the item rating by that user and applying what is called the Alternating Least Squares (ALS) method to infer user preference for items they did not purchase.

Given the historical purchase of each user, we will build a similarity matrix as follows: each column represents an item and each row a user, the cells in the matrix are the ratings that a user explicitly enters in the system (in our case we do not have ratings, but rather number of purchases, this is called implicit rating), call this similarity matrix R. The goal of ALS is to find matrices U and I such that their product approximates the similarity matrix, i.e., R ≈ U × I. The matrix U is called user factors while the values in I are called item factors. We call rank the dimension of the user factors space (the number of columns), which is the same as the number of rows in the item factors matrix. The next image illustrates this process for users and movies ratings (image from DataBricks):

Alternating Least Squares (ALS) method depicted for users and movies ratings. Given a similarity matrix R, the goal of ALS is to find matrices U and I such that their product approximates the similarity matrix, i.e., R ≈ U × I.

Once more we will use PySpark to clean our data and fit anALS model:

With our contend based and collaborative filtering pieces fitted to data, we can go ahead and build our system making use of both. Here is a quick diagram of how it looks like:

Hybrid recommendation system. A userid is supplied to the ALS module and to the kMeans module. The kMeans module assigns a cluster to the user based on their profile, then it builds the list of most popular items among users in the cluster, filters out the items already bought by user and returns a list of suggested items. ALS module creates a recommendation list from the user factor calculated from all the users past purchase history. The final list is either the list provided by the ALS list if it is not empty, otherwise it is the kMeans recommendation list.

I call this system the Best-Worst recommendation system not because of its structure or models, but due to the data provided and the complexity of the problem.

System Performance

I call this system the Best-Worst recommendation system not because of its structure or models, but due to the data provided and the complexity of the problem. The git repository accompanying this article provides a (realist, albeit artificial ) dataset that turns out to have a sparse features space. The result was a model that got only ~1% of the recommendations right, but despite the poor performance in the given data, the model is a great cornerstone for Spark based models and it will perform much better on more personalized data (where feature1,..., feature6 are changed to age, income, location, gender, etc... , for example).

Conclusion

In this article I described a simple recommendation system using PySpark that considered users profile information and past preferences.

One interesting aspect of this system is its simple and intuitive way of merging collaborative and content based filtering when some traditional machine learning models could take too long to fit the data.

This article was a summary of my git repo project, where you find a few jupyter notebooks, python code, data and a whole pdf report with even more insights on how to put your own recommendation system together.

Please, share and clap this article if you want to see more content like this coming.

Let me know of your own Recommender systems! Did you have similar results as mine? I am in Twitter @solvingthehuman.

More?

If you want to read:

  • How to setup you own Spark — Go here
  • Maybe you want to learn a little about Bayesian Statistics? — Go here

--

--

Data Ninja
Solving the Human Problem

Focusing on Machine Learning and AI. Solving problems for the humans.