Collaborative Filtering based Recommender System From Scratch
Recommender systems are one of the most successful and widespread application of machine learning technologies in industry. The ability to suggest products with most appealing factor to each consumer type of your clients base is trivial not only to keep your clients engaged, but to atract new costumers. The business model and success of big companys like Netflix, Amazon, Steam, Spotify, YouTube and many others, revolves around the potency of their Recommender Systems.
A recommender system deals with the task of predict the degree of similarity between items (movies, songs, clothing, shoes, etc) or users in a database. Based on these similarities the system can provide personalized recommendations of items for users.
Generally speaking, recommender systems can be classified into 3 types:
- Collaborative Filtering based: These systems predict the rating or preference that a user would give an item based on past user ratings or preferences. No item metadata is required, in contrast to its content-based counterparts.
- Content-based Filtering: These systems predict the item rating or user preference based on item metadata (genre, director, description, actors, color, etc) and user preferences and history. These systems dependes highly on item metadata.
- Hybrid systems: These systems try to leverage the best of both worlds and constitutes a wide range of state-of-art techniques in industry. These are complex systems, usually made of multiple smaller subsystems.
Those system types have their own pros and cons. In this post we’ll implement a simple Collaborative Filtering based system.
Collaborative Filtering
As said before, Collaborative Filtering (CF) is a type of recommendation technique that uses similarities between users (or items) to infer the possible level of interest of a user to a item unrated by him. These similarities are computed using sonenly existing user ratings for items. Thus specific item description metadata is not needed.
There are two general CF approaches:
- User-based, which exploits similarities between users. A rating prediction of an user to an item is computed using the item ratings given by similar users.
- Item-based, which exploits similarities between items. A rating prediction of an user to an item is computed using ratings of similar items.
In both cases, the first step to build the system is to compute an interactions matrix, where each row contains the ratings given by a user to all items in database, and each columns contains the ratings given by all users to an specific item.
Figure below illustrate an example interactions matrix. User ratings are in range 1 to 5. A rating 0 means that this specific item was not rated by this specific user.
The interactions matrix is, naturally, an sparse matrix: in practical scenarios, the majority of the users may not have rated the majority of the items. This may lead to adoption of specific algorithms to deal with sparse data.
Two main algorithm types are used to implement recommender systems (either item-based or user-based):
- Memory based, in which statistical techniques are applied to the entire dataset to calculate the rating predictions.
- Model based, which involve steps to reduce or compress the large, but possibly sparse, interactions matrix.
In this post, we’ll use a memory based strategy, which is simpler to explain and easier to understand. In a future post we’ll cover the implementation of a model based approach.
Imports
import os
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
import optuna
import matplotlib.pyplot as plt
import re
The MovieLens 100K Dataset
There are a number of datasets available for recommendations systems research. Among them, the MovieLens dataset is probably one of the more popular ones. MovieLens is a non-commercial web-based movie recommender system, created in 1997 and run by the GroupLens Research Project, at University of Minnesota. It’s data has been critical for several research studies including personalized recommendation and social psychology.
There are several versions of the dataset available. We’ll use the well-known MovieLens 100k, which consists of 100,000 ratings form 943 users on 1682 movies. Some simple demographic information such as age, gender, genres for the users and items are also available, but we’ll not use them.
if not os.path.isdir("ml-100k"):
# # Download the dataset
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
# # Unzip it
!unzip ml-100k.zip
# # Get rid of the .zip file
!rm ml-100k.zip
The dataset folder contains several files, but we need only two of them:
- u.data: The full dataset with 100,000 user ratings. Each user has rated at least 20 movies.
- u.item: Contains informations about the movies, but we will use only movies ids and titles. This data is not required for the understanding of the CF technique, but we will use it for a more friendly feedback of our system.
Let’s begin by loading the ratings dataframe.
ratings = pd.read_csv(
"ml-100k/u.data",
sep="\t", # this is a tab separated data
names=["user_id", "movie_id", "rating", "timestamp"], # the columns names
usecols=["user_id", "movie_id", "rating"], # we do not need the timestamp column
low_memory=False
)
ratings
Each entry consists in a user_id, movie_id and a movie rating in range of 1 to 5.
Train/test split
Now we’ll create the train and test sets that we will use to evaluate the performance of our system. 20% of each user ratings will be used for testing, and the remaining that will be used for training.
test_perc = 0.2
# Initialize the train and test dataframes.
train_set, test_set = pd.DataFrame(), pd.DataFrame()
# Check each user.
for user_id in ratings.user_id.unique():
user_df = ratings[ratings.user_id == user_id].sample(
frac=1,
random_state=42
) # select only samples of the actual user and shuffle the resulting dataframe
n_entries = len(user_df)
n_test = int(round(test_perc * n_entries))
test_set = pd.concat((test_set, user_df.tail(n_test)))
train_set = pd.concat((train_set, user_df.head(n_entries - n_test)))
train_set = train_set.sample(frac=1).reset_index(drop=True)
test_set = test_set.sample(frac=1).reset_index(drop=True)
train_set.shape, test_set.shape
We also need a function to compute the interactions matrix of a given ratings dataframe.
def build_interactions_matrix(r_mat, n_users, n_items):
iter_m = np.zeros((n_users, n_items))
for _, user_id, movie_id, rating in r_mat.itertuples():
iter_m[user_id-1, movie_id-1] = rating
return iter_m
iter_m = build_interactions_matrix(ratings, n_users, n_movies)
iter_m.shape
Memory based approach
In this post we’ll build our system using the memory based approach, in which similarities between users/items are computed using the rating data itself. Therefore, the i-th row of an interactions matrix is considered as the feature vector of user i, while the j-th column of an interaction matrix is considered as the feature vector of item j.
The similarity between two users is represented by some distance measurement between their feature vectors. Multiple measures, such as Pearson correlation and vector cosine are used for this. For example, the similarity between users u and u’ can be computed using vector cosine as:
where ru and ru′ are the feature vectors of users u and u′, respectively, and rui is a rating value given by user u to item i. The same procedure is applied when computing the similarity between items i and i′.
def build_similarity_matrix(interactions_matrix, kind="user", eps=1e-9):
# takes rows as user features
if kind == "user":
similarity_matrix = interactions_matrix.dot(interactions_matrix.T)
# takes columns as item features
elif kind == "item":
similarity_matrix = interactions_matrix.T.dot(interactions_matrix)
norms = np.sqrt(similarity_matrix.diagonal()) + eps
return similarity_matrix / (norms[np.newaxis, :] * norms[:, np.newaxis])
u_sim = build_similarity_matrix(iter_m, kind="user")
i_sim = build_similarity_matrix(iter_m, kind="item")
print(f"User similarity matrix shape: {u_sim.shape}\nUser similarity matrix sample:\n{u_sim[:4, :4]}")
print("-" * 97)
print(f"Item similarity matrix shape: {i_sim.shape}\nItem similarity matrix sample:\n{i_sim[:4, :4]}")
The similarity matrix is a symmetric matrix with values in range 0 to 1. The diagonal elements contains the auto-similarities of all users/items, so all elements are equal to 1.
Making Predictions
Now we are able to make predictions. Depending on which approach we have chosen for our system, we have two different objectives:
- If we choose the user-based approach, we’ll infer a missing rating rui of an user u to an item i by taking the normalized weighted sum of all ratings of other users to this item.
- If we choose the item-based approach instead, we’ll infer a missing rating rui of an user u to an item i by taking the normalized weighted sum of all other ratings of this user to the other items.
The Recommender class
Let’s build a Recommender class, that will do all the heavy lifting of compute/store the similarity matrices and make rating predictions for us.
class Recommender:
def __init__(
self,
n_users,
n_items,
r_mat,
kind="user",
eps=1e-9,
):
self.n_users = n_users
self.n_items = n_items
self.kind = kind
self.eps = eps
self.iter_m = build_interactions_matrix(r_mat, self.n_users, self.n_items)
self.sim_m = build_similarity_matrix(self.iter_m, kind=self.kind)
self.predictions = self._predict_all()
def _predict_all(self):
if self.kind == "user":
predictions = \
self.sim_m.dot(self.iter_m) / np.abs(self.sim_m + self.eps).sum(axis=0)[:, np.newaxis]
elif self.kind == "item":
predictions = \
self.iter_m.dot(self.sim_m) / np.abs(self.sim_m + self.eps).sum(axis=0)[np.newaxis, :]
return predictions
print("User-based predictions sample:")
print(Recommender(n_users, n_movies, train_set, kind="user").predictions[:4, :4])
print("-" * 97)
print("item-based predictions sample:")
print(Recommender(n_users, n_movies, train_set, kind="item").predictions[:4, :4])
Model evaluation
Now it’s time to assess the predictor performance. For this we’ll use Mean Squared Error (MSE) metric.
def build_predictions_df(preds_m, dataframe):
preds_v = []
for row_id, user_id, movie_id, _ in dataframe.itertuples():
preds_v.append(preds_m[user_id-1, movie_id-1])
preds_df = pd.DataFrame(data={"user_id": dataframe.user_id, "movie_id": dataframe.movie_id, "rating": preds_v})
return preds_df
def get_mse(estimator, train_set, test_set):
train_preds = build_predictions_df(estimator.predictions, train_set)
test_preds = build_predictions_df(estimator.predictions, test_set)
train_mse = mean_squared_error(train_set.rating, train_preds.rating)
test_mse = mean_squared_error(test_set.rating, test_preds.rating)
return train_mse, test_mse
train_mse, test_mse = get_mse(
Recommender(n_users, n_movies, train_set, kind="user"),
train_set,
test_set
)
print(f"User-based train MSE: {train_mse} -- User-based test MSE: {test_mse}")
print("-" * 97)
train_mse, test_mse = get_mse(
Recommender(n_users, n_movies, train_set, kind="item"),
train_set,
test_set
)
print(f"Item-based train MSE: {train_mse} -- Item-based test MSE: {test_mse}")
K-nearest neighbors
At this point the rating predictions are computed using the ratings of either all users or all items. This means that even the users/items with low similarity scores will be acounted for the prediction computation. A better approach is to take an smaller subset of the most similar users/items to make a prediction. This technique is usually refered as k-nearest neighbors, or KNN algorithm.
We’ll adapt our Estimator class to use only the most similar users/items to make an prediction.
class Recommender:
def __init__(
self,
n_users,
n_items,
r_mat,
k=40, # the number of neighbors to use when computing the similarity score
kind="user",
eps=1e-9
):
self.n_users = n_users
self.n_items = n_items
self.kind = kind
self.eps = eps
self.iter_m = build_interactions_matrix(r_mat, self.n_users, self.n_items)
self.sim_m = build_similarity_matrix(self.iter_m, kind=self.kind)
self.k = k
self.predictions = self._predict_all()
def _predict_all(self):
pred = np.empty_like(self.iter_m)
if self.kind == "user":
# An user has the higher similarity score with itself,
# so we skip the first element.
sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]
for user_id, k_users in enumerate(sorted_ids):
pred[user_id, :] = self.sim_m[user_id, k_users].dot(self.iter_m[k_users, :])
pred[user_id, :] /= np.abs(self.sim_m[user_id, k_users] + self.eps).sum()
elif self.kind == "item":
# An item has the higher similarity score with itself,
# so we skip the first element.
sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]
for item_id, k_items in enumerate(sorted_ids):
pred[:, item_id] = self.sim_m[item_id, k_items].dot(self.iter_m[:, k_items].T)
pred[:, item_id] /= np.abs(self.sim_m[item_id, k_items] + self.eps).sum()
return pred
train_mse, test_mse = get_mse(
Recommender(n_users, n_movies, train_set, kind="user"),
train_set,
test_set
)
print(f"User-based train MSE: {train_mse} -- User-based test MSE: {test_mse}")
print("-" * 97)
train_mse, test_mse = get_mse(
Recommender(n_users, n_movies, train_set, kind="item"),
train_set,
test_set
)
print(f"Item-based train MSE: {train_mse} -- Item-based test MSE: {test_mse}")
As we can see, this method alone can improve greatly our system’s prediction power. Later in this post, we’ll try leverage the effect of the number of neighbors to do a simple tuning in our system.
Bias subtraction
Now we’ll try to deal with the rating bias associated with an user or an item. The ideia here is that certain users may tend to always give high or low ratings to all movies, so the relative difference in ratings may be more important than the absolute rating values.
For a user-based approach this methodology can be mathematically described as:
where r̅ᵤ is the average rating given by user u, or for a item-based approach as:
where r̅ᵢ is the average rating of item i
Lets modify our Recommender class once more to include this feature.
class Recommender:
def __init__(
self,
n_users,
n_items,
r_mat,
k=40,
kind="user",
bias_sub=False,
eps=1e-9
):
self.n_users = n_users
self.n_items = n_items
self.kind = kind
self.iter_m = build_interactions_matrix(r_mat, self.n_users, self.n_items)
self.sim_m = build_similarity_matrix(self.iter_m, kind=self.kind)
self.bias_sub = bias_sub
self.k = k
self.eps = eps
self.predictions = self._predict_all()
def _predict_all(self):
pred = np.empty_like(self.iter_m)
if self.kind == "user":
# Computes the new interaction matrix if needed.
iter_m = self.iter_m
if self.bias_sub:
user_bias = self.iter_m.mean(axis=1)[:, np.newaxis]
iter_m -= user_bias
# An user has the higher similarity score with itself,
# so we skip the first element.
sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]
for user_id, k_users in enumerate(sorted_ids):
pred[user_id, :] = self.sim_m[user_id, k_users].dot(iter_m[k_users, :])
pred[user_id, :] /= \
np.abs(self.sim_m[user_id, k_users] + self.eps).sum() + self.eps
if self.bias_sub:
pred += user_bias
elif self.kind == "item":
# Computes the new interaction matrix if needed.
iter_m = self.iter_m
if self.bias_sub:
item_bias = self.iter_m.mean(axis=0)[np.newaxis, :]
iter_m -= item_bias
# An item has the higher similarity score with itself,
# so we skip the first element.
sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]
for item_id, k_items in enumerate(sorted_ids):
pred[:, item_id] = self.sim_m[item_id, k_items].dot(iter_m[:, k_items].T)
pred[:, item_id] /= \
np.abs(self.sim_m[item_id, k_items] + self.eps).sum() + self.eps
if self.bias_sub:
pred += item_bias
return pred.clip(0, 5)
train_mse, test_mse = get_mse(
Recommender(n_users, n_movies, train_set, kind="user", bias_sub=True),
train_set,
test_set
)
print(f"User-based train MSE: {train_mse} -- User-based test MSE: {test_mse}")
print("-" * 97)
train_mse, test_mse = get_mse(
Recommender(n_users, n_movies, train_set, kind="item", bias_sub=True),
train_set,
test_set
)
print(f"Item-based train MSE: {train_mse} -- Item-based test MSE: {test_mse}")
Although this methodology did not improved the results for this scenario, possibly due to the characteristics of the dataset, it can be effective with another dataset.
Tuning up
There is one question left: how do we find the right number of similar users/items we should use when predicting a rating?
The author of the blog post that we this post was based used the so-called Elbow Method to find the value of k. Although this is a fine methodology, I’m personally more inclined to use more programmatic ways to achieve those kind of goals.
If you thought of model hyper-parameters search methods, like grid-search or random-search, then we are in the same line of thinking.
Scikit-learn package already has some standard implementations of these methods based on cross-validation scores, but these are quite limited to a random search (using RandomSearchCV) or a full greedy search (using GridSearchCV). Although we could certainly implement our own parameters search routines, here I optioned to use the Optuna package. It has a simple and powerful interface is a must-use tool for any Machine Learning practitioner.
We begin by define a objective function to be minimized. This function will be responsible for choosing the parameters to be used, instantiating the model with these parameters and evaluating the model’s performance. Our use case is quite simple, as we have only two parameters (k and bias_sub).
def objective(trial):
# The list of hyper-parameters we want to optmizer. For each one we define the bounds
# and the corresponding name.
k = trial.suggest_int("k", 10, 200)
bias_sub = trial.suggest_categorical("bias_sub", [False, True])
# Instantiating the model
model = Recommender(n_users, n_movies, train_set, kind="item", k=k, bias_sub=bias_sub)
# Evaluating the performance
_, test_mse = get_mse(model, train_set, test_set)
return test_mse
Next, we defini our study object, that will be responsible for the hyperparameter search itself, as well as keep track of the optimization’s history.
study = optuna.create_study(direction="minimize")
# Here the parameter search effectively begins.
study.optimize(objective, n_trials=100)
Now we can retrieve the best parameters found.
best_k = study.best_params["k"]
best_bias_sub = study.best_params["bias_sub"]
print("Best parameters found:")
print(f" - k = {best_k}")
print(f" - bias_sub = {best_bias_sub}")
We can even plot a convergence plot.
optuna.visualization.matplotlib.plot_optimization_history(study);
Item recommendation
Now that we have defined our parameters, we can make our system do what it is suposed to: recommend items. There are many ways to acomplish this, but I optioned to go for the simple way and just recommend items most similar to a item using a item-based system.
Let’s make the last modification to our Recommender class.
class Recommender:
def __init__(
self,
n_users,
n_items,
r_mat,
k=40,
kind="user",
bias_sub=False,
eps=1e-9
):
self.n_users = n_users
self.n_items = n_items
self.kind = kind
self.iter_m = build_interactions_matrix(r_mat, self.n_users, self.n_items)
self.sim_m = build_similarity_matrix(self.iter_m, kind=self.kind)
self.bias_sub = bias_sub
self.k = k
self.eps = eps
self.predictions = self._predict_all()
def _predict_all(self):
pred = np.empty_like(self.iter_m)
if self.kind == "user":
# Computes the new interaction matrix if needed.
iter_m = self.iter_m
if self.bias_sub:
user_bias = self.iter_m.mean(axis=1)[:, np.newaxis]
iter_m -= user_bias
# An user has the higher similarity score with itself,
# so we skip the first element.
sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]
for user_id, k_users in enumerate(sorted_ids):
pred[user_id, :] = self.sim_m[user_id, k_users].dot(iter_m[k_users, :])
pred[user_id, :] /= np.abs(self.sim_m[user_id, k_users]).sum() + self.eps
if self.bias_sub:
pred += user_bias
elif self.kind == "item":
# Computes the new interaction matrix if needed.
iter_m = self.iter_m
if self.bias_sub:
item_bias = self.iter_m.mean(axis=0)[np.newaxis, :]
iter_m -= item_bias
# An item has the higher similarity score with itself,
# so we skip the first element.
sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]
for item_id, k_items in enumerate(sorted_ids):
pred[:, item_id] = self.sim_m[item_id, k_items].dot(iter_m[:, k_items].T)
pred[:, item_id] /= np.abs(self.sim_m[item_id, k_items]).sum() + self.eps
if self.bias_sub:
pred += item_bias
return pred.clip(0, 5)
def get_top_recomendations(self, item_id, n=6):
if self.kind == "user":
# For an user-based system, only similarities between users were computed.
# This strategy will not be covered in this post, but a solution to this
# could be of finding the top better rated items of similiar users.
# I'll leave this exercise to you =]
pass
if self.kind == "item":
sim_row = self.sim_m[item_id - 1, :]
# once again, we skip the first item for obviouos reasons.
items_idxs = np.argsort(-sim_row)[1:n+1]
similarities = sim_row[items_idxs]
return items_idxs + 1, similarities
We added a method to return the n most similar items to a given item. Now, we just need to buil our model with the parameters found previously.
rs_model = Recommender(
n_users,
n_movies,
ratings, # the model will be built on the full dataset now
k=best_k,
kind="item",
bias_sub=best_bias_sub
)
get_mse(rs_model, train_set, test_set)
We’ll also define two functions: one that maps a movie title to an id and other that maps a list of movie ids into a list of movie titles.
def title2id(mapper_df, movie_title):
return mapper_df.loc[mapper_df.movie_title == movie_title, "movie_title"].index.values[0]
def ids2title(mapper_df, ids_list):
titles = []
for id in ids_list:
titles.append(mapper_df.loc[id, "movie_title"])
return titles
Those functions need a dataframe with the movies ids and titles, that will act as a mapper. So we’ll load the u.item file from the dataset folder.
# Columns names
movies_mapper_cols = [
"movie_id",
"movie_title",
"release_date",
"video_release_date",
"IMDb_URL",
"unknown",
"Action",
"Adventure",
"Animation",
"Childrens",
"Comedy",
"Crime",
"Documentary",
"Drama",
"Fantasy",
"Film_Noir",
"Horror",
"Musical",
"Mystery",
"Romance",
"Sci_Fi",
"Thriller",
"War",
"Western"
]
movies_mapper = pd.read_csv(
"ml-100k/u.item",
sep="|",
encoding="latin",
names=movies_mapper_cols,
usecols=["movie_id", "movie_title"], # we only need these columns
index_col="movie_id"
)
# Remove movies release years from titles
movies_mapper["movie_title"] = movies_mapper["movie_title"].apply(
lambda title: re.sub("\(\d{4}\)", "", title).strip()
)
movies_mapper
Now we can make our recommendations.
def print_recommendations(model, mapper, movie_title):
ids_list, similarities = rs_model.get_top_recomendations(title2id(mapper, movie_title))
titles = ids2title(movies_mapper, ids_list)
for title, similarity in zip (titles, similarities):
print(f"{similarity:.2f} -- {title}")
print_recommendations(rs_model, movies_mapper, "Toy Story")
print_recommendations(rs_model, movies_mapper, "Batman Returns")
print_recommendations(rs_model, movies_mapper, "GoldenEye")
print_recommendations(rs_model, movies_mapper, "Godfather, The")
print_recommendations(rs_model, movies_mapper, "Billy Madison")
print_recommendations(rs_model, movies_mapper, "Lion King, The")
print_recommendations(rs_model, movies_mapper, "Star Wars")
Conclusion
What can we say about those recommendations?
Some recommendations are pretty good. For example, “Batman Returns” is similar to other Batman movies, that’s for sure. “The Lion King” is similar to “Aladdin” and “Beauty and the Beast” (Disney movies). “Star Wars” is similar to other Star Wars movies.
Some of them does not make much sense. “Toy Story” does not seems so similar to “The Rock” and “Mission: Impossible” to me. “The Lion King” is not so similar to “Jurassic Park” too, aside that both have animals as their main feature. And I certainly cannot understand the similarity between “The Godfather” movies with “Star Wars” movies.
Personal conclusions aside, there are some well-known advantages of memory-based Collaborative Filtering systems in literature. Some of them are:
- Explainability of the results: which is an important aspect of recommendations systems;
- Easy of creation and use: all methodology is purely based in manipulations of the interactions matrix;
- Content-independence of the items being recommended: no descriptions, nor genre definitions, nor color definitions, nor size definition, etc. No metadata at all.
But nothing in this world is perfect, and this approach suffers from disadvantages too. Some of them are:
- Performance decreases when data gets more sparse: which occurs frequently with web-related items;
- Adding new items becomes more complicated: since the data sctructures representations as an interactions matrix relies on a specific vector space, adding new items requires inclusion of the new item and the re-insertion of all elements in the structure.
- Recommendations tend to be already popular: items from the so-called long-tail section might get ignored.
The notebook of this post can found here.
A great thanks to Ethan Rosenthal, the author of the blog post which of the ideas and concepts presented here were based on.
Other usefull links:
- Wikipedia’s article on Collaborative Filtering.
- Real Python post on Collaborative Filtering.
- Data Flair Recommendations System Project in R.
- Optuna page.
You can find the complete notebook and other projects in this Github repository.