Comparing matrix factorization with transformers for MovieLens recommendations using PyTorch-accelerated

Published in

Data Science at Microsoft

21 min readJan 4, 2022

Whether we are shopping online, listening to music, or even pondering which news article to read, recommendation systems play an important part in preventing us from being overwhelmed by choice and exposing us to content that would be difficult to otherwise discover. Providing good recommendations is central to an excellent customer experience and is one of the most effective ways to build user investment and loyalty in a service or product; so much so that whole business models are built around providing the best possible recommendations, making these systems crucial to the profitability of these companies! As such, it is no surprise that, in Microsoft CSE, implementing state-of-the-art recommendation techniques is a popular request from our customers; one such project was my first engagement!

Here, I aim to demonstrate how transformers can be used to predict ratings from sequences of past behavior, as well as seeing how this compares with the more widely known matrix factorization approach; transformers are driving significant advancements in NLP, vision, and time series domains, so it’s about time they came to recommenders, right? The approach I explore is inspired by this paper with some slight tweaks to the task and the architecture.

Image by Yazril Tri Mulyana from Pixabay.

To minimize the amount of boilerplate code required, and to ensure that the focus remains clearly on the models, I use the PyTorch-accelerated library to handle the training process, which lets us scale our training loop to multiple GPUs with no code changes. If you are unfamiliar with PyTorch-accelerated and would like to learn more about it before diving into this article, please check out the introductory blog post or the docs; alternatively, it’s very simple and a lack of knowledge in this area should not impair your understanding of the content explored here! The main focus of this article is on the implementation and comparison of the different approaches; therefore, while we shall touch upon a wide range of topics and techniques, only an intuition behind these methods shall be provided, with links to additional sources provided for you to learn more.

So, let’s dive in!

Tl;dr: If you just want to see some working code that you can use directly, all of the code required to replicate this post is available as a GitHub gist here. While gists are used as code snippets throughout this article, this is primarily for aesthetic reasons, and these snippets may not work as intended if copied directly. For working implementations, please defer to the notebook in the gist linked above.

The packages used are:

Loading and preparing the data

For our dataset, we shall use MovieLens-1M, a collection of one million ratings from 6,000 users on 4,000 movies. This dataset was collected and is maintained by GroupLens, a research group at the University of Minnesota, and released in 2003; it has been frequently used in the Machine Learning community and is commonly presented as a benchmark in academic papers.

MovieLens consists of three files, ‘movies.dat’, ‘users.dat’, and ‘ratings.dat’, which have the following formats:

The Users, Movies and Ratings tables contained in the MovieLens dataset

Let’s combine some of this information into a single DataFrame, to make it easier for us to work with.

Using pandas, we can print some high-level statistics about the dataset, which may be useful to us.

From this, we can see that all ratings are between 1 and 5 and every item has been rated at least once. As every user has rated at least 20 movies, we don’t have to worry about the case of how to recommend items to a user where we know nothing about their preferences — but this is often not the case in the real world!

Splitting into training and validation sets

Before we start modeling, we need to split this dataset into training and validations sets. Often, splitting the dataset is done by randomly sampling a selection of rows, which is a good approach in some cases. However, as we intend to train a transformer model on sequences of ratings, this approach will not work for our purposes. This is because, if we were to simply remove a set of random rows, this is not a good representation of the task that we are trying to model; as it is likely that, for some users, ratings from the middle of a sequence will end up in the validation set. To understand why this would be a problem, let’s consider a small example.

Suppose that a user watches, and rates, the following sequence of movies and, after looking at the training set, we would like to predict how the user will rate Rocky during evaluation on the validation set.

Given that we know that the user enjoyed Rocky 2, as this appeared in the training set, it is likely that the user will enjoy Rocky; therefore, we would predict that the user will rate this movie highly. However, as the user watched Rocky 2 after watching Rocky, in the real world, we would not have had access to this information at the point that the user is rating Rocky and having access to subsequent information enables us to ‘cheat’ at the task in this case. This is known as data leakage, which can artificially boost performance metrics, but is not representative of what we are likely to observe in a production scenario. To avoid this, there are a few different approaches that we could take.

One approach would be to define the validation set by sampling users; randomly selecting a subset of users and ensuring that all the ratings made for these users are contained in the validation set. While this would work, the model would learn nothing about this set of users during training, so we would be treating every item in the validation set as if it had been made by a new user of the system. As such, the only way that the model would be able to predict ratings for these users would be based on the preferences of other, similar users. Therefore, while evaluation on new users should definitely be considered as part of a wider evaluation strategy for a recommendation engine going into production, it would provide a suboptimal representation of whether the models we are going to explore can learn the task in our case.

An alternative approach would be to use a strategy known as ‘leave-one-out’ validation, in which we select the last chronological rating for each user, given that they have rated some number of items greater than a defined threshold. As this is a good representation of the approach we are trying to model, this is the approach we shall use here. Let’s define a function to get the last n ratings for each user:

We can now use this to define another function to mark the last n ratings per user as our validation set; representing this using the is_valid column:

Applying this to our DataFrame, we can see that we now have a validation set of 6,040 rows — one for each user.

Even when considering model benchmarks on the same dataset, to have a fair comparison, it is important to understand how the data has been split and to make sure that the approaches taken are consistent!

Evaluating our recommender models

Evaluating recommendation systems is a highly complex task, which is largely out of the scope of this article but has been comprehensively explored elsewhere! Here, we shall evaluate our models using three different metrics and, as there are many articles available discussing the intuition behind these metrics in more detail, we present only a brief overview here:

Mean Absolute Error (MAE): the average of the absolute differences between a set of values and predictions.
Mean Squared Error (MSE): the average of the squared differences between a set of values and predictions.
Root Mean Squared Error (RMSE): the square root of the MSE, such that it is the same order of magnitude as the MAE.

Whether this is the best approach is debatable, but this is a common method of evaluating recommendation models and is frequently seen in both the academic and production settings. Personally, I primarily look at MAE as the overall measure of performance due to its consistent interpretability; as opposed to RMSE, which tends to penalize larger errors more heavily. This is discussed in more detail here.

Creating a baseline model

When starting a new modeling task, it is often a good idea to create a very simple model — known as a baseline model — to perform the task in a straightforward way that requires minimal effort to implement. We can then use the metrics from this model as a comparison for all future approaches; if a complex model is getting worse results than the baseline model, this is a bad sign!

Here, an approach that we can use for this is to simply predict the average rating for every movie, irrespective of context. As the mean can be heavily affected by outliers, let’s use the median for this. We can easily calculate the median rating from our training set as follows:

We can then use this as the prediction for every rating in the validation set and calculate our metrics:

These numbers are pretty meaningless on their own but will be useful as a comparison for the approaches that we are going to explore.

Matrix factorization with bias

One very popular approach toward recommendations, both in academia and industry, is matrix factorization. Let’s explore the intuition behind the approach; a more comprehensive treatment is given here. In addition to representing recommendations in a table, such as our DataFrame, an alternative view would be to represent a set of user-item ratings as a matrix. We can visualize this on a sample of our data as presented below:

As not every user will have rated every movie, we can see that some values are missing. Therefore, we can formulate our recommendation problem in the following way:

How can we fill in the blanks, such that the values are consistent with the existing ratings in the matrix?

One way that we can approach this is by considering that there are two smaller matrices that can be multiplied together to make our ratings matrix, as visualized below:

The key idea is that each row of the user features matrix U represents the preferences of an individual user, and the columns of the movie feature matrix M represent the features of each movie. At this point, you may be wondering how exactly we find these matrices. While there are analytical ways of doing this — which tend to be quite computationally intensive — as we are doing Machine Learning, there is only one answer: we initialize them with random values and use some variant of gradient descent to learn them!

This idea, that there are some latent values that represent an individual user or a movie, is analogous to the use of embeddings in computer vision or NLP — where a learned vector is used to represent the feature of an image or a word.

So far, we have the formula:

Where, the product of each row of matrix U with each column of matrix M corresponds to that user’s preference for that movie.

However, while this will work, we can often get better results by incorporating bias terms, as follows:

Here, each bias term is a vector containing a single feature corresponding to each user, or movie, respectively. The introduction of the movie bias term enables the model to learn a feature to represent how that movie is rated, compared to the average across all other movies. Similarly, the user bias can be learned to represent that user’s tendency to give higher or lower ratings than the average. For example, a grumpy user, who tends to rate movies consistently with one star, would have a large bias term, which could be added to each rating to bring them in line with the ratings of an ‘average’ reviewer.

If all of that seems a little confusing, don’t worry! It will become much clearer when we express it in code.

PyTorch implementation

Before we think about training a model, we first need to get the data into the correct format. Currently, we have a title that represents each movie, which is a string; we need to convert this to an integer format so that we can feed it into the model. While we already have an ID representing each user, let’s also create our own encoding for this. I generally find it good practice to control all the encodings related to a training process, rather than relying on predefined ID systems defined elsewhere; you will be surprised how many IDs that are supposed to be immutable and unique turn out to be otherwise in the real world!

Here, we can do this very simply by enumerating every unique value for both users and movies. We can create lookup tables for this as shown below:

Now that we can encode our features, as we are using PyTorch, we need to define a Dataset to wrap our DataFrame and return the user-item ratings.

We can now use this to create our training and validation datasets:

Next, let’s define the model.

As we can see, this is very simple to define. Note that because an embedding layer is simply a lookup table, it is important that when we specify the size of the embedding layer, it must contain any value that will be seen during training and evaluation. Because of this, we will use the number of unique items observed in the full dataset to do this, not just the training set. We have also specified a padding embedding at index 0, which can be used for any unknown values. PyTorch handles this by setting this entry to a zero-vector, which is not updated during training.

Additionally, as this is a regression task, the range that the model could predict is potentially unbounded. While the model can learn to restrict the output values to between 1 and 5, we can make this easier for the model by modifying the architecture to restrict this range prior to training. We have done this by applying the sigmoid function to the model’s output — which restricts the range to between 0 and 1 — and then scaling this to within a range that we can define.

At this point, we would usually start writing the training loop; however, as we are using PyTorch-accelerated, this will largely be taken care of for us. However, as PyTorch-accelerated tracks only the training and validation losses by default, let’s create a callback to track our metrics. More information about creating callbacks can be found here.

To calculate our metrics, we are going to use torchmetrics, which are distributed training compatible, so that we won’t need to gather results from different processes before computing metrics.

Now, all that is left to do is to train the model. PyTorch-accelerated provides a notebook_launcher function, which enables us to run multi-GPU training runs from within a notebook. To use this, all we need to do is to define a training function that instantiates our Trainer object and calls the train method.

Components such as the model and dataset can be defined anywhere in the notebook, but it is important that the trainer is only ever instantiated within a training function. If you prefer to create training scripts and run them from the command line, more information on how to do this with PyTorch-accelerated is available here.

Using MSE as our loss, and using the AdamW optimizer, with a OneCycle learning rate schedule, we can define our training function as follows.

We can now launch training by passing this function to the ‘notebook_launcher’:

After 21 epochs, we end up with the following metrics:

Comparing this to our baseline, we can see that there is an improvement!

Sequential recommendations using a transformer

Using matrix factorization, we are treating each rating as being independent from the ratings around it; however, incorporating information about other movies that a user recently rated could provide an additional signal that could boost performance. For example, suppose that a user is watching a trilogy of films; if they have rated the first two instalments highly, it is likely that they may do the same for the finale!

One way that we can approach this is to use a transformer network, specifically the encoder portion, to encode additional context into the learned embeddings for each movie, and then using a fully connected neural network to make the rating predictions.

The way that the transformer helps us here is primarily due to its self-attention mechanism — which, along with most of what you need to know about transformers, is excellently explained here and here — to compute an attention score for each movie in the sequence. The attention score, computed for each movie with respect to every other movie, represents the relevancy of different items in the sequence to each other; if two movie embeddings are related, the attention score should be higher than for two movies that are seemingly unrelated. Of course, the model initially has no intuition of which movies are related, so the movie embeddings, along with other internal parameters within the transformer, must be learned to produce this behavior.

Let’s consider an example of how this works at a conceptual level. Suppose that a user watches, and rates, the following sequence of movies:

As films in the same franchise, the two Harry Potter movies are very relevant to each other — they have the same director, tone, and target audience — and so we would like the embeddings for these movies to produce a high attention score. In contrast, The Killer Snakes — a horror movie about a giant man-eating snake — is a very different type of film from Harry Potter and the Philosopher’s Stone, so the attention score should be negligible. However, because Harry Potter and the Chamber of Secrets also involves a giant snake, it bears a passing similarity to The Killer Snakes, and so the attention score may be slightly higher than when comparing The Killer Snakes to Harry Potter and the Philosopher’s Stone.

Of course, some of these relationships can be quite subtle, which is why transformers usually require a huge amount of data to achieve exceptional results! In our case, we are assuming that we can learn this information by looking at the sequences of movies watched by lots of different users. While it is likely that a subset of users will have watched both Harry Potter movies, for the model to pick up on a ‘contains a killer snake’ feature, we would require that we have a small subset of users who watch marathons of ‘killer snake’ movies, which may be a stretch!

Conceptually, the transformer will then use these attention scores to produce a matrix that defines the contribution of each movie to each other movie, as represented below:

This matrix can then be used to encode this context into our movie embeddings. We can think of this operation as replacing the movie embedding for Harry Potter and the Philosopher’s Stone with 0.7*(Harry Potter and the Philosopher’s Stone embedding) + 0.3*(Harry Potter and the Chamber of Secrets embedding).

Note that this is only a conceptual understanding of how a transformer works and many intricacies have been skipped over. To understand how a transformer really works, please refer to the references linked above, or the original paper.

Pre-processing the data

Now that we have a (very!) high level intuition of why including a transformer may help us with this task, the first step is to process our data so that we have a time-sorted list of movies for each user. Let’s start by grouping all the ratings by user:

Now that we have grouped all the ratings for each user, let’s divide these into smaller sequences. To make the most out of the data, we would like the model to have the opportunity to predict a rating for every movie in the training set. To do this, let’s specify a sequence length s and use the previous s-1 ratings as our user history.

As the model expects each sequence to be a fixed length, we will fill empty spaces with a padding token, so that sequences can be batched and passed to the model. Let’s create a function to do this.

To visualize how this function works, let’s apply it, with a sequence length of 3, to the first 10 movies rated by the first user. These movies are:

Applying our function, we have:

As we can see, we have 10 sequences of length 3, where the final movie in the sequence is unchanged from the original list.

Now, let’s apply this function to all the features in our DataFrame. Here, we arbitrarily choose a length of 10.

Currently, we have one row that contains all the sequences for a certain user. However, during training, we would like to create batches made up of sequences from many different users. To do this, we will have to transform the data so that each sequence has its own row, while remaining associated with the user ID. We can use the pandas ‘explode’ function for each feature, and then aggregate these DataFrames together.

Now, we can see that each sequence has its own row. However, for the is_valid column, we don’t care about the whole sequence and only need the last value as this is the movie for which we will be trying to predict the rating. Let’s create a function to extract this value and apply it to these columns.

Also, to make it easy to access the rating that we are trying to predict, let’s separate this into its own column.

To prevent the model from including padding tokens when calculating attention scores, we can provide an attention mask to the transformer; the mask should be ‘True’ for a padding token and ‘False’ otherwise. Let’s calculate this for each row, as well as creating a column to show the number of padding tokens present.

Let’s inspect the transformed data:

All looks as it should! Let’s split this into training and validation sets.

Now, the data is in the correct format for our transformer model.

Training the model

As we saw previously, before we can feed this data into the model, we need to create lookup tables to encode our movies and users. However, this time, we need to include the padding token in our movie lookup.

Now, we are dealing with sequences of ratings, rather than individual ones, so we will need to create a new dataset to wrap our processed DataFrame:

We can now use this to create our training and validation datasets:

Now, let’s define our transformer model! As a start, given that the matrix factorization model can achieve good performance using only the user and movie ids, let’s only include this information for now.

We can see that, as a default, we feed our sequence of movie embeddings into a single transformer layer, before concatenating the output with the user features — here, just the user ID — and using this as the input to a fully connected network. Here, we are using only a simple positional encoding that is learned to represent the sequence in which the movies were rated; using a sine and cosine-based approach provided no benefit during my experiments, but feel free to try it out if you are interested!

Once again, let’s define a training function for this model; except for the model initialization, this is identical to the one we used to train the matrix factorization model.

We can now use the notebook launcher to train this model:

After six epochs, we observe the following results:

We can see that this is a significant improvement over the matrix factorization approach!

As the model started to overfit, our callbacks stopped training early and loaded the best model weights for us!

Adding additional data

So far, we have only considered the user ID and a sequence of movie IDs to predict the rating; it seems likely that including information about the previous ratings made by the user would improve performance. Thankfully, this is easy to do, and the data is already being returned by our dataset. Let’s tweak our architecture to include this:

We can see that, to use the ratings data, we have added an additional embedding layer. For each previously rated movie, we then add together the movie embedding, the positional encoding and the rating embedding before feeding this sequence into the transformer. Alternatively, the rating data could be concatenated to, or multiplied with, the movie embedding, but adding them together worked the best out of the approaches that I tried.

As Jupyter maintains a live state for each class definition, we don’t need to update our training function; the new class will be used when we launch training:

After eight epochs, we obtain the following results:

We can see that incorporating the ratings data has improved our results slightly!

Adding user features

In addition to the ratings data, we also have more information about the users that we could add into the model. To remind ourselves, let’s take a look at the users table:

Let’s try adding in the categorical variables representing the users’ sex, age groups, and occupation to the model, and see if we see any improvement. While occupation looks like it is already sequentially numerically encoded, we must do the same for the sex and age_group columns. We can use the ‘LabelEncoder’ class from scikit-learn to do this for us, and append the encoded columns to the DataFrame:

Now that we have all the features that we are going to use encoded, let’s join the user features to our sequences DataFrame, and update our training and validation sets.

Let’s update our dataset to include these features.

We can now modify our architecture to include embeddings for these features and concatenate these embeddings to the output of the transformer; then we pass this into the feed-forward network.

Launching training with these new features included, after seven epochs we get the following results:

Here, we can see a slight decrease in the MAE, but a small increase in the MSE and RMSE, so it looks like these features made a negligible difference to the overall performance.

Conclusion

To summarize, the results that we obtained from each model are:

In writing this article, my main objective has been to try and illustrate how these approaches can be used, and so I’ve picked the hyperparameters somewhat arbitrarily; it’s likely that with some hyperparameter tweaks, and different combinations of features, these metrics can probably be improved upon!

Hopefully this has provided a good introduction to using both matrix factorization and transformer-based approaches in PyTorch, and how PyTorch-accelerated can speed up our process when experimenting with different models!

All the code required to replicate this post is available as a GitHub gist here . While gists are used as code snippets throughout this article, this is primarily for aesthetic reasons, and these snippets may not work as intended if copied directly. For working implementations, please defer to that gist.

Chris Hughes is on LinkedIn.