Getting Started with AI: Building a movie recommendation model from scratch!

Ada Choudhry
12 min readAug 29, 2023

--

Collaborative filtering, explaining OOP concepts, breaking down embedding vectors, and so much more….

Welcome (or welcome back) to the Getting Started with AI series where I explore tutorials and courses to learn about AI practically through top-bottom approaches AKA learning while building. I also share my personal struggles along the way, so if you’re struggling with machine learning, doubting whether it is for you, you’re not alone.

I love the tutorials that build from scratch. In the last one, I built an image classifier using the fast.ai library which is quite high-level and therefore easy to use. I would definitely recommend using that for beginners. But it automates a lot of the processing and once you have a good grasp of the concepts in Machine learning, it makes sense to shift to lower-level approaches to have a greater understanding as well as control over the architecture of neural networks!

Here I will be using PyTorch to build my movie recommendation model. Let’s dive in!

This tutorial is based on Chapter 8 ‘Collaborative Filtering Deep Dive’ from the book Deep Learning for Coders with Fastai and PyTorch. It is available on the web for free!

Collaborative Filtering

Collaborative filtering is a technique used in recommendation systems to predict a user’s preferences or interests by collecting and analyzing information from the preferences and behaviors of a group of users. It matches what person A likes currently to other people who like the same thing (suppose person B), and then predicts what A might like based on what B watched after watching the same thing.

The fundamental idea behind collaborative filtering is that if users A and B have agreed on certain issues in the past, they are likely to agree on future issues as well.

Collaborative filtering is widely used in various recommendation systems, including movie recommendations, music playlists, product recommendations, and more. There are two main types of collaborative filtering:

  1. User-Based Collaborative Filtering: In user-based collaborative filtering, the system identifies similar users based on their past behaviors and preferences. If user A and user B have similar patterns of interactions and preferences for items, the system might recommend items that user B has liked to user A. This method relies on finding neighbors (similar users) and then suggesting items that the neighbors have shown interest in.
  2. Item-Based Collaborative Filtering: Item-based collaborative filtering focuses on the relationships between items rather than users. It identifies items that are similar to the ones a user has already interacted with or liked. If user A likes item X and item Y is often liked by users who liked item X, then the system might recommend item Y to user A. This method is often based on calculating item-item similarity.

For collaborative filtering, we more commonly refer to items, rather than products. Items could be links that people click, diagnoses that are selected for patients, and so forth.

Latent factors

Latent factors, in the context of collaborative filtering and matrix factorization, are hidden or underlying characteristics or features that represent users and items (such as movies) in a lower-dimensional space. These latent factors capture unobservable relationships between users and items that influence user preferences, item characteristics, or interactions.

In collaborative filtering, the idea is to represent users and items as vectors in this latent factor space. The goal is to learn these latent factors from observed user-item interactions (ratings, likes, etc.) so that the model can make predictions about how a user would rate or interact with an item they haven’t encountered before.

Here’s an analogy to help understand latent factors:

In a movie recommendation system, each movie can be represented by a set of characteristics (latent factors) that describe its genre, director, actors, style, and so on. Similarly, each user can be described by their preferences, like how much they enjoy action movies, comedies, dramas, and so forth. We can easily denote a number to represent each of these characteristics in a vector.

By reducing these complex characteristics into a lower-dimensional latent space, the model can capture relationships like “Users who like action movies tend to also like movies starring a particular actor.” This simplification allows the model to find patterns and make personalized recommendations even if there’s no explicit data connecting a specific user to a specific movie.

How latent factors are used in a neural network?

Initially, the model generates latent factors for both users and items randomly, and through training and optimization, the latent factors get learned over time.

But wait that is still vague!

Once we have our set of latent factors for users and items, we take the dot product of these vectors. If we knew for each user to what degree they liked each important category that a movie might fall into, such as genre, age, preferred directors and actors, and so forth, and we knew the same information about each movie, then a simple way to fill in this table would be to multiply this information together for each movie and use a combination. For instance, assuming these factors range between –1 and +1, with positive numbers indicating stronger matches and negative numbers weaker ones, and the categories are science-fiction, action, and old movies, then we could represent the movie The Last Skywalker as follows:

last_skywalker = np.array([0.98,0.9,-0.9])

When the dot product would be high, it would mean that the recommendation is good, otherwise not.

How are latent factors learned?

A loss function measures the difference between the model’s predicted interactions (based on latent factors) and the actual observed interactions in the training data. The goal is to minimize this loss function during training.

But wait, how does the model measure the observed interactions?

It measures it based on the ratings users gave to the movies versus the predictions.

Latent factors are a powerful example of unsupervised learning because they enable the model to infer meaningful features without human-defined labels. They’re learned solely from the interactions between users and items in the data, making them a valuable tool for recommendation systems and other unsupervised learning tasks.

Unsupervised learning is a machine learning approach where the model learns patterns and structures from unlabeled data. It’s about finding hidden relationships and structures within the data without any explicit guidance in the form of labeled outcomes.

To learn how this works in practice, let’s break down the whole model:

The goal is to build a recommendation system that suggests movies to users based on their preferences.

  1. Data Preparation: To train this model, we gather users along with their ratings for movies with a matrix where rows correspond to users and columns correspond to movies. Each cell contains a user’s rating for a particular movie, and some cells may be empty (indicating missing ratings).
  2. Unsupervised Learning — Latent Factors: It doesn’t have labels telling it which movies are good or bad; instead, it learns patterns from the data itself.
  3. Latent Factors Initialization: We initialize latent factor matrices for users and movies with random values. Each row in the user matrix represents a user’s preferences across latent factors, and each row in the movie matrix represents a movie’s characteristics across latent factors.
  4. Learning Process: The model learns by adjusting the latent factor values iteratively to minimize the difference between the predicted ratings (based on latent factors) and the actual ratings in the dataset.
  5. Finding Patterns: As the model iterates, it discovers patterns in the data. It learns that certain latent factor combinations correspond to certain types of movies or user preferences.
  6. Making Recommendations: Once the model has learned the latent factors, it can predict the missing ratings and recommend movies to users. It identifies users whose latent factor vectors are similar to a given user’s latent factors and suggests movies that those similar users have rated highly.
  7. Generalization: The model generalizes from the training data to make predictions for unseen user-movie pairs. It predicts how a user would rate a movie they haven’t seen based on the latent factors’ relationships.

Embeddings

Embeddings are a fundamental concept in machine learning and deep learning that involve representing discrete categorical variables, such as words, users, or items, as continuous vectors in a lower-dimensional space. They are an important concept in NLP. The difference between embeddings and latent factors is that the latter is more specific to collaborative filtering.

Building our movie recommendation model!

This model is available in this Google Colab notebook!

Let’s import the necessary libraries.

We have downloaded a great dataset that we can use, called MovieLens. This dataset contains tens of millions of movie rankings (a combination of a movie ID, a user ID, and a numeric rating), although we will just use a subset of 100,000 of them for our example. If you’re interested, it would be a great learning project to try to replicate this approach on the full 25 million recommendation dataset, which you can get from their website.

Here, we are viewing the movie ratings given by the users as well as a movie title table.

We merge these tables as they have a common column called movies as well as add a column called title. We store this data in dls using the CollabDataLoaders class from fast.ai which automates the creation of a dataset. The batch size of the dataset is 64.

The number of users is the length of the class user and the number of movies is the length of the class title. We have chosen the number of latent factors to be 5. The factors have been randomly generated using torch.randn.

This function is used to create learnable parameters (model parameters) with initial values drawn from a normal distribution. This function is typically used when initializing weights and biases for layers in a neural network.

  1. nn.Parameter: nn.Parameter is a PyTorch class that wraps a tensor as a learnable parameter. This is used to indicate that the tensor should be treated as a parameter during optimization.
  2. torch.zeros(*size): This creates a tensor of zeros with dimensions specified by the size argument. The * operator is used to unpack the dimensions from the size tuple.
  3. .normal_(0, 0.1): The normal_ method initializes the tensor's values with random numbers drawn from a normal distribution with a mean of 0 and a standard deviation of 0.1. The underscore _ at the end of normal_ indicates that the method operates in place, modifying the tensor's values.
  4. Return Value: The function returns an nn.Parameter object containing the initialized tensor with the specified dimensions.

This is the main part of building our model!

We define a new class called DotProductBias.

The __init__ function is the method Python will call when a new object is created. So, this is where you can set up any state that needs to be initialized upon object creation. Any parameters included when the user constructs an instance of your class will be passed to the __init__ method as parameters. The first parameter to any method defined inside a class is self, so you can use this to set and get any attributes that you will need.

A new PyTorch module requires inheriting from Module which is why it is mentioned after the class is declared. It provides the basic foundations for building our class. This demonstrates the concept of Inheritance in which we can add additional behavior to an existing class.

Here, we have initialized user factors with values of parameters given to a matrix of users and their latent factors. A similar thing is done with movie factors. We also add a bias to the n_users tensor and n_movies tensor.

A great way to understand what the bias represents is to suppose a user likes action movies from the 90s. However, there is a good chance they might also like Titanic even though it is of a different genre because it appeals to large audiences. In this case, the bias of Titanic would be high and would result in a larger sum of dot product of user and movie latent factors. Suppose if a user rarely likes a movie then their bias would be low!

Bias terms are used to capture the inherent tendencies of users to rate items higher or lower than average, and the inherent popularity or quality of movies.

When we create a new PyTorch module and it is called, PyTorch will call a method in your class called forward, and will pass along any parameters that are included in the call.

self.user_factors and self.movie_factors are methods or layers within your model that are used to retrieve the latent factors (embedding vectors) for users and movies respectively. The input of the model is a tensor of shape batch_size x 2, where the first column (x[:, 0]) contains the user IDs, and the second column (x[:, 1]) contains the movie IDs. In this case, the size of x would be ([64, 2]).

(users * movies) performs element-wise multiplication between the user and movie latent factors. This step combines the information from both factors.

.sum(dim=1) calculates the sum along dimension 1 of the resulting tensor, which effectively combines the information from different latent factors and provides a single value for each example in the batch. This value could represent the predicted preference or rating for the user-movie pair. Since the multiplication was done element-wise across the latent factor vectors, summing along dimension 1 effectively combines the dot products of users and movies for each example in the batch.

res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]

In this statement, the user and movie bias are retrieved and added to the sum of movie predictions generated earlier.

To make sure the ratings are between 0 and 5, we apply the sigmoid range. After experimentation, it was observed that adding 5.5 as the higher limit produces better results.

In the context of recommendation systems, this operation produces a predicted score or rating for each user-movie pair in the batch, capturing the interaction between users and movies while considering their latent features

Now, we can finally train our model!

The learning rate is 5e-3 but wait! What does the variable wd stand for?

It stands for Weight Decay and it aims to reduce overfitting. Overfitting is a concept in machine learning and statistics where a model learns to perform extremely well on the training data but performs poorly on unseen or new data. In other words, an overfit model captures noise or random fluctuations in the training data rather than the underlying true patterns or relationships that generalize to new data.

When a model overfits, it essentially “memorizes” the training data instead of “learning” from it. This can lead to poor generalization of new data, causing the model’s performance to degrade significantly when applied to real-world situations.

Weight decay, or L2 regularization, consists of adding to your loss function the sum of all the weights squared. Why do that? Because when we compute the gradients, it will add a contribution to them that will encourage the weights to be as small as possible. If we take a parabola y = a * (x**2) the larger the coefficient a is, the sharper the parabola will be which will cause the loss function to be sharp and narrow as well,

Chapter 8 Deep Learning for Coders with fastai and PyTorch

Letting the model learn high parameters might cause it to fit all the data points in the training set with an overcomplex function that has very sharp changes, which will lead to overfitting. Limiting our weights from growing too much is going to hinder the training of the model, but it will yield a state where it generalizes better.

However, adding the squares of each of the weights would be inefficient and numerically unstable. The derivation of a square a**2 is 2*a, which is what we add along with the parameter wd which we can choose!

That’s it for building our model! But wait, don’t you want to see what your model has learned?

Results

I extracted the lowest learned biases to see which movies it inherently ranks as bad

And then I do the same for the movies whose bias is high

That makes sense!

I hope you learned something new in this tutorial! Do share which parts of the tutorial you liked and where you struggled.

I’ll come back soon with another exciting tutorial!

Until then, keep building, keep learning!

--

--