Collaborative Filtering based Recommender System From Scratch

16 min readApr 5, 2024

Recommender systems are one of the most successful and widespread application of machine learning technologies in industry. The ability to suggest products with most appealing factor to each consumer type of your clients base is trivial not only to keep your clients engaged, but to atract new costumers. The business model and success of big companys like Netflix, Amazon, Steam, Spotify, YouTube and many others, revolves around the potency of their Recommender Systems.

A recommender system deals with the task of predict the degree of similarity between items (movies, songs, clothing, shoes, etc) or users in a database. Based on these similarities the system can provide personalized recommendations of items for users.

Generally speaking, recommender systems can be classified into 3 types:

Collaborative Filtering based: These systems predict the rating or preference that a user would give an item based on past user ratings or preferences. No item metadata is required, in contrast to its content-based counterparts.
Content-based Filtering: These systems predict the item rating or user preference based on item metadata (genre, director, description, actors, color, etc) and user preferences and history. These systems dependes highly on item metadata.
Hybrid systems: These systems try to leverage the best of both worlds and constitutes a wide range of state-of-art techniques in industry. These are complex systems, usually made of multiple smaller subsystems.

Those system types have their own pros and cons. In this post we’ll implement a simple Collaborative Filtering based system.

Collaborative Filtering

As said before, Collaborative Filtering (CF) is a type of recommendation technique that uses similarities between users (or items) to infer the possible level of interest of a user to a item unrated by him. These similarities are computed using sonenly existing user ratings for items. Thus specific item description metadata is not needed.

There are two general CF approaches:

User-based, which exploits similarities between users. A rating prediction of an user to an item is computed using the item ratings given by similar users.
Item-based, which exploits similarities between items. A rating prediction of an user to an item is computed using ratings of similar items.

In both cases, the first step to build the system is to compute an interactions matrix, where each row contains the ratings given by a user to all items in database, and each columns contains the ratings given by all users to an specific item.

Figure below illustrate an example interactions matrix. User ratings are in range 1 to 5. A rating 0 means that this specific item was not rated by this specific user.

The interactions matrix is, naturally, an sparse matrix: in practical scenarios, the majority of the users may not have rated the majority of the items. This may lead to adoption of specific algorithms to deal with sparse data.

Two main algorithm types are used to implement recommender systems (either item-based or user-based):

Memory based, in which statistical techniques are applied to the entire dataset to calculate the rating predictions.
Model based, which involve steps to reduce or compress the large, but possibly sparse, interactions matrix.

In this post, we’ll use a memory based strategy, which is simpler to explain and easier to understand. In a future post we’ll cover the implementation of a model based approach.

Imports

import os

import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error

import optuna

import matplotlib.pyplot as plt

import re

The MovieLens 100K Dataset

There are a number of datasets available for recommendations systems research. Among them, the MovieLens dataset is probably one of the more popular ones. MovieLens is a non-commercial web-based movie recommender system, created in 1997 and run by the GroupLens Research Project, at University of Minnesota. It’s data has been critical for several research studies including personalized recommendation and social psychology.

There are several versions of the dataset available. We’ll use the well-known MovieLens 100k, which consists of 100,000 ratings form 943 users on 1682 movies. Some simple demographic information such as age, gender, genres for the users and items are also available, but we’ll not use them.

if not os.path.isdir("ml-100k"):
    # # Download the dataset
    !wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
    # # Unzip it
    !unzip ml-100k.zip
    # # Get rid of the .zip file
    !rm ml-100k.zip

The dataset folder contains several files, but we need only two of them:

u.data: The full dataset with 100,000 user ratings. Each user has rated at least 20 movies.
u.item: Contains informations about the movies, but we will use only movies ids and titles. This data is not required for the understanding of the CF technique, but we will use it for a more friendly feedback of our system.

Let’s begin by loading the ratings dataframe.

ratings = pd.read_csv(
    "ml-100k/u.data",
    sep="\t", # this is a tab separated data
    names=["user_id", "movie_id", "rating", "timestamp"], # the columns names
    usecols=["user_id", "movie_id", "rating"], # we do not need the timestamp column
    low_memory=False
)
ratings

Each entry consists in a user_id, movie_id and a movie rating in range of 1 to 5.

Train/test split

Now we’ll create the train and test sets that we will use to evaluate the performance of our system. 20% of each user ratings will be used for testing, and the remaining that will be used for training.

test_perc = 0.2

# Initialize the train and test dataframes.
train_set, test_set = pd.DataFrame(), pd.DataFrame()

# Check each user.
for user_id in ratings.user_id.unique():
    user_df = ratings[ratings.user_id == user_id].sample(
        frac=1,
        random_state=42
    ) # select only samples of the actual user and shuffle the resulting dataframe
    
    n_entries = len(user_df)
    n_test = int(round(test_perc * n_entries))
    
    test_set = pd.concat((test_set, user_df.tail(n_test)))
    train_set = pd.concat((train_set, user_df.head(n_entries - n_test)))

train_set = train_set.sample(frac=1).reset_index(drop=True)
test_set = test_set.sample(frac=1).reset_index(drop=True)

train_set.shape, test_set.shape

We also need a function to compute the interactions matrix of a given ratings dataframe.

def build_interactions_matrix(r_mat, n_users, n_items):
    iter_m = np.zeros((n_users, n_items))
    
    for _, user_id, movie_id, rating in r_mat.itertuples():
        iter_m[user_id-1, movie_id-1] = rating
    
    return iter_m

iter_m = build_interactions_matrix(ratings, n_users, n_movies)
iter_m.shape

Memory based approach

In this post we’ll build our system using the memory based approach, in which similarities between users/items are computed using the rating data itself. Therefore, the i-th row of an interactions matrix is considered as the feature vector of user i, while the j-th column of an interaction matrix is considered as the feature vector of item j.

The similarity between two users is represented by some distance measurement between their feature vectors. Multiple measures, such as Pearson correlation and vector cosine are used for this. For example, the similarity between users u and u’ can be computed using vector cosine as:

where ru and ru′ are the feature vectors of users u and u′, respectively, and rui is a rating value given by user u to item i. The same procedure is applied when computing the similarity between items i and i′.

def build_similarity_matrix(interactions_matrix, kind="user", eps=1e-9):
    # takes rows as user features
    if kind == "user":
        similarity_matrix = interactions_matrix.dot(interactions_matrix.T)
    # takes columns as item features
    elif kind == "item":
        similarity_matrix = interactions_matrix.T.dot(interactions_matrix)
    norms = np.sqrt(similarity_matrix.diagonal()) + eps
    return similarity_matrix / (norms[np.newaxis, :] * norms[:, np.newaxis])

u_sim = build_similarity_matrix(iter_m, kind="user")
i_sim = build_similarity_matrix(iter_m, kind="item")

print(f"User similarity matrix shape: {u_sim.shape}\nUser similarity matrix sample:\n{u_sim[:4, :4]}")
print("-" * 97)
print(f"Item similarity matrix shape: {i_sim.shape}\nItem similarity matrix sample:\n{i_sim[:4, :4]}")

The similarity matrix is a symmetric matrix with values in range 0 to 1. The diagonal elements contains the auto-similarities of all users/items, so all elements are equal to 1.

Making Predictions

Now we are able to make predictions. Depending on which approach we have chosen for our system, we have two different objectives:

If we choose the user-based approach, we’ll infer a missing rating rui of an user u to an item i by taking the normalized weighted sum of all ratings of other users to this item.

If we choose the item-based approach instead, we’ll infer a missing rating rui of an user u to an item i by taking the normalized weighted sum of all other ratings of this user to the other items.

The Recommender class

Let’s build a Recommender class, that will do all the heavy lifting of compute/store the similarity matrices and make rating predictions for us.

class Recommender:
    def __init__(
        self,
        n_users,
        n_items,
        r_mat,
        kind="user",
        eps=1e-9,
    ):
        self.n_users = n_users
        self.n_items = n_items
        self.kind = kind
        self.eps = eps
        self.iter_m = build_interactions_matrix(r_mat, self.n_users, self.n_items)
        self.sim_m = build_similarity_matrix(self.iter_m, kind=self.kind)
        self.predictions = self._predict_all()
    
    def _predict_all(self):
        if self.kind == "user":
            predictions = \
                self.sim_m.dot(self.iter_m) / np.abs(self.sim_m + self.eps).sum(axis=0)[:, np.newaxis]
        elif self.kind == "item":
            predictions = \
                self.iter_m.dot(self.sim_m) / np.abs(self.sim_m + self.eps).sum(axis=0)[np.newaxis, :]
        return predictions

print("User-based predictions sample:")
print(Recommender(n_users, n_movies, train_set, kind="user").predictions[:4, :4])
print("-" * 97)
print("item-based predictions sample:")
print(Recommender(n_users, n_movies, train_set, kind="item").predictions[:4, :4])

Model evaluation

Now it’s time to assess the predictor performance. For this we’ll use Mean Squared Error (MSE) metric.

def build_predictions_df(preds_m, dataframe):
    preds_v = []
    for row_id, user_id, movie_id, _ in dataframe.itertuples():
        preds_v.append(preds_m[user_id-1, movie_id-1])
    preds_df = pd.DataFrame(data={"user_id": dataframe.user_id, "movie_id": dataframe.movie_id, "rating": preds_v})
    return preds_df

def get_mse(estimator, train_set, test_set):
    train_preds = build_predictions_df(estimator.predictions, train_set)
    test_preds = build_predictions_df(estimator.predictions, test_set)
    
    train_mse = mean_squared_error(train_set.rating, train_preds.rating)
    test_mse = mean_squared_error(test_set.rating, test_preds.rating)
    
    return train_mse, test_mse

train_mse, test_mse = get_mse(
    Recommender(n_users, n_movies, train_set, kind="user"),
    train_set,
    test_set
)

print(f"User-based train MSE: {train_mse} -- User-based test MSE: {test_mse}")
print("-" * 97)

train_mse, test_mse = get_mse(
    Recommender(n_users, n_movies, train_set, kind="item"),
    train_set,
    test_set
)

print(f"Item-based train MSE: {train_mse} -- Item-based test MSE: {test_mse}")

K-nearest neighbors

At this point the rating predictions are computed using the ratings of either all users or all items. This means that even the users/items with low similarity scores will be acounted for the prediction computation. A better approach is to take an smaller subset of the most similar users/items to make a prediction. This technique is usually refered as k-nearest neighbors, or KNN algorithm.

We’ll adapt our Estimator class to use only the most similar users/items to make an prediction.

class Recommender:
    def __init__(
        self,
        n_users,
        n_items,
        r_mat,
        k=40, # the number of neighbors to use when computing the similarity score
        kind="user",
        eps=1e-9
    ):
        self.n_users = n_users
        self.n_items = n_items
        self.kind = kind
        self.eps = eps
        self.iter_m = build_interactions_matrix(r_mat, self.n_users, self.n_items)
        self.sim_m = build_similarity_matrix(self.iter_m, kind=self.kind)
        self.k = k
        self.predictions = self._predict_all()
    
    def _predict_all(self):
        pred = np.empty_like(self.iter_m)
        if self.kind == "user":
            # An user has the higher similarity score with itself,
            # so we skip the first element.
            sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]
            for user_id, k_users in enumerate(sorted_ids):
                pred[user_id, :] = self.sim_m[user_id, k_users].dot(self.iter_m[k_users, :])
                pred[user_id, :] /= np.abs(self.sim_m[user_id, k_users] + self.eps).sum()
        elif self.kind == "item":
            # An item has the higher similarity score with itself,
            # so we skip the first element.
            sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]
            for item_id, k_items in enumerate(sorted_ids):
                pred[:, item_id] = self.sim_m[item_id, k_items].dot(self.iter_m[:, k_items].T)
                pred[:, item_id] /= np.abs(self.sim_m[item_id, k_items] + self.eps).sum()
        return pred

train_mse, test_mse = get_mse(
    Recommender(n_users, n_movies, train_set, kind="user"),
    train_set,
    test_set
)

print(f"User-based train MSE: {train_mse} -- User-based test MSE: {test_mse}")
print("-" * 97)

train_mse, test_mse = get_mse(
    Recommender(n_users, n_movies, train_set, kind="item"),
    train_set,
    test_set
)

print(f"Item-based train MSE: {train_mse} -- Item-based test MSE: {test_mse}")

As we can see, this method alone can improve greatly our system’s prediction power. Later in this post, we’ll try leverage the effect of the number of neighbors to do a simple tuning in our system.

Bias subtraction

Now we’ll try to deal with the rating bias associated with an user or an item. The ideia here is that certain users may tend to always give high or low ratings to all movies, so the relative difference in ratings may be more important than the absolute rating values.

For a user-based approach this methodology can be mathematically described as:

where r̅ᵤ is the average rating given by user u, or for a item-based approach as:

where r̅ᵢ is the average rating of item i

Lets modify our Recommender class once more to include this feature.

class Recommender:
    def __init__(
        self,
        n_users,
        n_items,
        r_mat,
        k=40,
        kind="user",
        bias_sub=False,
        eps=1e-9
    ):
        self.n_users = n_users
        self.n_items = n_items
        self.kind = kind
        self.iter_m = build_interactions_matrix(r_mat, self.n_users, self.n_items)
        self.sim_m = build_similarity_matrix(self.iter_m, kind=self.kind)
        self.bias_sub = bias_sub
        self.k = k
        self.eps = eps
        self.predictions = self._predict_all()
    
    def _predict_all(self):
        pred = np.empty_like(self.iter_m)
        if self.kind == "user":
            # Computes the new interaction matrix if needed.
            iter_m = self.iter_m
            if self.bias_sub:
                user_bias = self.iter_m.mean(axis=1)[:, np.newaxis]
                iter_m -= user_bias
            # An user has the higher similarity score with itself,
            # so we skip the first element.
            sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]
            for user_id, k_users in enumerate(sorted_ids):
                pred[user_id, :] = self.sim_m[user_id, k_users].dot(iter_m[k_users, :])
                pred[user_id, :] /= \
                    np.abs(self.sim_m[user_id, k_users] + self.eps).sum() + self.eps
            if self.bias_sub:
                pred += user_bias
            
        elif self.kind == "item":
            # Computes the new interaction matrix if needed.
            iter_m = self.iter_m
            if self.bias_sub:
                item_bias = self.iter_m.mean(axis=0)[np.newaxis, :]
                iter_m -= item_bias
            # An item has the higher similarity score with itself,
            # so we skip the first element.
            sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]
            for item_id, k_items in enumerate(sorted_ids):
                pred[:, item_id] = self.sim_m[item_id, k_items].dot(iter_m[:, k_items].T)
                pred[:, item_id] /= \
                    np.abs(self.sim_m[item_id, k_items] + self.eps).sum() + self.eps
            if self.bias_sub:
                pred += item_bias
                
        return pred.clip(0, 5)

train_mse, test_mse = get_mse(
    Recommender(n_users, n_movies, train_set, kind="user", bias_sub=True),
    train_set,
    test_set
)

print(f"User-based train MSE: {train_mse} -- User-based test MSE: {test_mse}")
print("-" * 97)

train_mse, test_mse = get_mse(
    Recommender(n_users, n_movies, train_set, kind="item", bias_sub=True),
    train_set,
    test_set
)

print(f"Item-based train MSE: {train_mse} -- Item-based test MSE: {test_mse}")

Although this methodology did not improved the results for this scenario, possibly due to the characteristics of the dataset, it can be effective with another dataset.

Tuning up

There is one question left: how do we find the right number of similar users/items we should use when predicting a rating?

The author of the blog post that we this post was based used the so-called Elbow Method to find the value of k. Although this is a fine methodology, I’m personally more inclined to use more programmatic ways to achieve those kind of goals.

If you thought of model hyper-parameters search methods, like grid-search or random-search, then we are in the same line of thinking.

Scikit-learn package already has some standard implementations of these methods based on cross-validation scores, but these are quite limited to a random search (using RandomSearchCV) or a full greedy search (using GridSearchCV). Although we could certainly implement our own parameters search routines, here I optioned to use the Optuna package. It has a simple and powerful interface is a must-use tool for any Machine Learning practitioner.

We begin by define a objective function to be minimized. This function will be responsible for choosing the parameters to be used, instantiating the model with these parameters and evaluating the model’s performance. Our use case is quite simple, as we have only two parameters (k and bias_sub).

def objective(trial):
    # The list of hyper-parameters we want to optmizer. For each one we define the bounds
    # and the corresponding name.
    k = trial.suggest_int("k", 10, 200)
    bias_sub = trial.suggest_categorical("bias_sub", [False, True])

    # Instantiating the model
    model = Recommender(n_users, n_movies, train_set, kind="item", k=k, bias_sub=bias_sub)
    # Evaluating the performance
    _, test_mse = get_mse(model, train_set, test_set)
    return test_mse

Next, we defini our study object, that will be responsible for the hyperparameter search itself, as well as keep track of the optimization’s history.

study = optuna.create_study(direction="minimize")
# Here the parameter search effectively begins.
study.optimize(objective, n_trials=100)

Now we can retrieve the best parameters found.

best_k = study.best_params["k"]
best_bias_sub = study.best_params["bias_sub"]

print("Best parameters found:")
print(f"  - k = {best_k}")
print(f"  - bias_sub = {best_bias_sub}")

We can even plot a convergence plot.

optuna.visualization.matplotlib.plot_optimization_history(study);

Item recommendation

Now that we have defined our parameters, we can make our system do what it is suposed to: recommend items. There are many ways to acomplish this, but I optioned to go for the simple way and just recommend items most similar to a item using a item-based system.

Let’s make the last modification to our Recommender class.

class Recommender:
    def __init__(
        self,
        n_users,
        n_items,
        r_mat,
        k=40,
        kind="user",
        bias_sub=False,
        eps=1e-9
    ):
        self.n_users = n_users
        self.n_items = n_items
        self.kind = kind
        self.iter_m = build_interactions_matrix(r_mat, self.n_users, self.n_items)
        self.sim_m = build_similarity_matrix(self.iter_m, kind=self.kind)
        self.bias_sub = bias_sub
        self.k = k
        self.eps = eps
        self.predictions = self._predict_all()
    
    def _predict_all(self):
        pred = np.empty_like(self.iter_m)
        if self.kind == "user":
            # Computes the new interaction matrix if needed.
            iter_m = self.iter_m
            if self.bias_sub:
                user_bias = self.iter_m.mean(axis=1)[:, np.newaxis]
                iter_m -= user_bias
            # An user has the higher similarity score with itself,
            # so we skip the first element.
            sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]
            for user_id, k_users in enumerate(sorted_ids):
                pred[user_id, :] = self.sim_m[user_id, k_users].dot(iter_m[k_users, :])
                pred[user_id, :] /= np.abs(self.sim_m[user_id, k_users]).sum() + self.eps
            if self.bias_sub:
                pred += user_bias
            
        elif self.kind == "item":
            # Computes the new interaction matrix if needed.
            iter_m = self.iter_m
            if self.bias_sub:
                item_bias = self.iter_m.mean(axis=0)[np.newaxis, :]
                iter_m -= item_bias
            # An item has the higher similarity score with itself,
            # so we skip the first element.
            sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]
            for item_id, k_items in enumerate(sorted_ids):
                pred[:, item_id] = self.sim_m[item_id, k_items].dot(iter_m[:, k_items].T)
                pred[:, item_id] /= np.abs(self.sim_m[item_id, k_items]).sum() + self.eps
            if self.bias_sub:
                pred += item_bias
                
        return pred.clip(0, 5)
    
    def get_top_recomendations(self, item_id, n=6):
        if self.kind == "user":
            # For an user-based system, only similarities between users were computed.
            # This strategy will not be covered in this post, but a solution to this
            # could be of finding the top better rated items of similiar users.
            # I'll leave this exercise to you =]
            pass
        if self.kind == "item":
            sim_row = self.sim_m[item_id - 1, :]
            # once again, we skip the first item for obviouos reasons.
            items_idxs = np.argsort(-sim_row)[1:n+1]
            similarities = sim_row[items_idxs]
            return items_idxs + 1, similarities

We added a method to return the n most similar items to a given item. Now, we just need to buil our model with the parameters found previously.

rs_model = Recommender(
    n_users, 
    n_movies, 
    ratings, # the model will be built on the full dataset now
    k=best_k, 
    kind="item", 
    bias_sub=best_bias_sub
)
get_mse(rs_model, train_set, test_set)

We’ll also define two functions: one that maps a movie title to an id and other that maps a list of movie ids into a list of movie titles.

def title2id(mapper_df, movie_title):
    return mapper_df.loc[mapper_df.movie_title == movie_title, "movie_title"].index.values[0]

def ids2title(mapper_df, ids_list):
    titles = []
    for id in ids_list:
        titles.append(mapper_df.loc[id, "movie_title"])
    return titles

Those functions need a dataframe with the movies ids and titles, that will act as a mapper. So we’ll load the u.item file from the dataset folder.

# Columns names
movies_mapper_cols = [
    "movie_id", 
    "movie_title", 
    "release_date", 
    "video_release_date", 
    "IMDb_URL", 
    "unknown",
    "Action",
    "Adventure",
    "Animation",
    "Childrens",
    "Comedy",
    "Crime",
    "Documentary",
    "Drama",
    "Fantasy",
    "Film_Noir",
    "Horror",
    "Musical",
    "Mystery",
    "Romance",
    "Sci_Fi",
    "Thriller",
    "War",
    "Western" 
]
movies_mapper = pd.read_csv(
    "ml-100k/u.item",
    sep="|",
    encoding="latin",
    names=movies_mapper_cols,
    usecols=["movie_id", "movie_title"], # we only need these columns
    index_col="movie_id"
)
# Remove movies release years from titles
movies_mapper["movie_title"] = movies_mapper["movie_title"].apply(
    lambda title: re.sub("\(\d{4}\)", "", title).strip()
)
movies_mapper

Now we can make our recommendations.

def print_recommendations(model, mapper, movie_title):
    ids_list, similarities = rs_model.get_top_recomendations(title2id(mapper, movie_title))
    titles = ids2title(movies_mapper, ids_list)
    for title, similarity in zip (titles, similarities):
        print(f"{similarity:.2f} -- {title}")

print_recommendations(rs_model, movies_mapper, "Toy Story")

print_recommendations(rs_model, movies_mapper, "Batman Returns")

print_recommendations(rs_model, movies_mapper, "GoldenEye")

print_recommendations(rs_model, movies_mapper, "Godfather, The")

print_recommendations(rs_model, movies_mapper, "Billy Madison")

print_recommendations(rs_model, movies_mapper, "Lion King, The")

print_recommendations(rs_model, movies_mapper, "Star Wars")

Conclusion

What can we say about those recommendations?

Some recommendations are pretty good. For example, “Batman Returns” is similar to other Batman movies, that’s for sure. “The Lion King” is similar to “Aladdin” and “Beauty and the Beast” (Disney movies). “Star Wars” is similar to other Star Wars movies.

Some of them does not make much sense. “Toy Story” does not seems so similar to “The Rock” and “Mission: Impossible” to me. “The Lion King” is not so similar to “Jurassic Park” too, aside that both have animals as their main feature. And I certainly cannot understand the similarity between “The Godfather” movies with “Star Wars” movies.

Personal conclusions aside, there are some well-known advantages of memory-based Collaborative Filtering systems in literature. Some of them are:

Explainability of the results: which is an important aspect of recommendations systems;
Easy of creation and use: all methodology is purely based in manipulations of the interactions matrix;
Content-independence of the items being recommended: no descriptions, nor genre definitions, nor color definitions, nor size definition, etc. No metadata at all.

But nothing in this world is perfect, and this approach suffers from disadvantages too. Some of them are:

Performance decreases when data gets more sparse: which occurs frequently with web-related items;
Adding new items becomes more complicated: since the data sctructures representations as an interactions matrix relies on a specific vector space, adding new items requires inclusion of the new item and the re-insertion of all elements in the structure.
Recommendations tend to be already popular: items from the so-called long-tail section might get ignored.

The notebook of this post can found here.

A great thanks to Ethan Rosenthal, the author of the blog post which of the ideas and concepts presented here were based on.