Decoding the Matrix —A Journey Into the Algorithms of Recommender Systems — Part 4: Matrix Factorization
The Setup
The key behind matrix factorization is we want to find two matrices W and U^T such that our user-ratings matrix can be approximated by R hat. Sometimes we have way too much data to actually store an NxM user-ratings matrix, so these two matrices we’re trying to find are going to be of size NxK and KxM, and we’ll use them to find the missing entries of our user item matrix.
More precisely, and we’ll work on this expression throughout this entire article:
We know from Singular Value Decomposition that a matrix can be decomposed into three smaller matrices. One KxN, one KxK, and one KxM.
If I multiply U by S, or even S by V^T, I get back matrices of the original size. So SVD tells us we can factor a matrix into 3 (but 2 if we want to) matrices.
Latent Features
Let’s take a step back here and figure out why we’re doing this in the first place. Pretend each of the k elements in w_i and u_j are the following features: comedy, action, romance, thriller, horror.
Then w_i[1] represents how much user i prefers comedy. Similarly u_j[1] represents how much of a comedy-movie movie j is. When we take the dot product of these two vectors, we’re finding out how well user i’s preference’s correlate with movie j since the dot product is proportional to the cosine similarity.
In reality these features are latent, meaning we don’t set them from the beginning. They are found by training our algorithm. Since SVD is a dimensionality reduction technique, what we’re doing is optimizing our model how to learn the most important features required to recreate the user-ratings matrix.
Matrix Factorization Training (math time)
Okay, we got our model for now. It’s time to do our optimization problem. As per usual, we’ll define a loss function and get the gradient to zero and solve. Since we’re dealing with a regression problem, let’s use square loss (J).
Where Omega is the set of pairs (i,j) where user i has rated movie j. We can’t work with missing data.
Using the loss function we can find the derivative with respect to w_i and u_j. It’s a simple chain rule step here, the only thing we have to be careful about is the summation changing. Since we’re taking the derivative with respect to w_i we only care about the ratings that depend on w_i, i.e the set of all movies user i has rated.
Where Psi_i is the set of all movies user i has rated. The next step is to try an isolate w_i which is transposed and stuck inside a dot product.
The next step is to move things around and change the order of the dot product (which is commutative) and movie u_j to the other side (it’s all multiplication).
Since w_i doesn’t depend on j, we can move it outside the summation, and see that what we want is the outer product of u_j with itself summed over j multiplied by w_i is equal to the sum of the ratings times u_j for all j movies the user has rated.
Now it’s in the form of Ax = b which is easy to solve with np.linalg.solve(A,b). For completion’s sake the answer is
Since the loss function is completely symmetric between w_i and u_j we also have the answer for u_j.
Where Omega_j is the set of all users who have rated item j.
Iterative Least Squares
Now we have a two way dependency problem where the solution for W depends on U and the solution for U depends on W. We’ll apply an alternating least squares approach.
The first step is to start with a random initialization for U and W and until convergence update w_i and u_j one after another.
Expanding the Model
Much like how we studied deviations from a user or movie’s average rating, we should do the same for our matrix factorization model. So our final step for generating a recommendation score will look something like
Where b_i is the user’s bias, c_j is the movie’s bias and mu is the global average rating.
Now since we’ve updated the model we’re going to have to find gradient descent updates for the all the parameters. Here they are straight up because otherwise this article is going to be way too long!
The update for u_j is symmetric:
Here’s the update for the user bias.
And by symmetry again:
You can just calculate mu from the data so don’t try and find an update for mu.
Regularization
The last step before implementation is to add some regularization to the weights in the loss function to prevent large values and in turn prevent over fitting.
So now the loss looks like:
Where the norm of a matrix (||A||_F) is the Frobenius norm. The derivation of the update rule is a little messy and it’s well documented online. Here are the update rules for now:
For w_i:
For u_j:
For b_i:
For c_j:
Matrix Factorization in Code
Necessary imports and preprocessing steps
from tqdm import tqdm
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
# Preprocessing steps
# Reading in the dataset
print('loading csv')
ratings_df = pd.read_csv('rating.csv')
print('mappings')
# Assuming that none of the movie or user IDs go from 0 to N/M, so I created a mapping
user_mapping = {}
new_user_id = 0
for user_id in ratings_df['userId'].unique():
user_mapping[user_id] = new_user_id
new_user_id += 1
movie_mapping = {}
new_movie_id = 0
for movie_id in ratings_df['movieId'].unique():
movie_mapping[movie_id] = new_movie_id
new_movie_id += 1
ratings_df['userId'] = ratings_df['userId'].map(user_mapping)
ratings_df['movieId'] = ratings_df['movieId'].map(movie_mapping)
# We also want a dictionary of ratings for easy access
ratings_dict = {}
for row in ratings_df.itertuples(index=False):
user_id = row.userId
movie_id = row.movieId
rating = row.rating
ratings_dict[(user_id, movie_id)] = rating
print('train and test sets')
# Splitting into train and test sets
train_df, test_df = train_test_split(ratings_df, test_size=0.2, random_state=42)
print('more mappings')
# And another dictionary for easy access to a user's rated movies in the train set
user_to_movies = {}
for row in train_df.itertuples(index=False):
user_id = row.userId
movie_id = row.movieId
if user_id not in user_to_movies:
user_to_movies[user_id] = []
user_to_movies[user_id].append(movie_id)
# And another dictionary for easy access to a movie's users who have rated in the train set
movie_to_users = {}
for row in train_df.itertuples(index=False):
user_id = row.userId
movie_id = row.movieId
if movie_id not in movie_to_users:
movie_to_users[movie_id] = []
movie_to_users[movie_id].append(user_id)
print('preprocessing done')
Defining global variables
# Global variables defined here
# We can set a global variable for N and M in a nice way :)
N, M = new_user_id, new_movie_id
#setting lambda (regularization penalty) here
LAMBDA = 20
#we'll also set K (latent dimensionality) here
K = 10
#the global average rating
MU = ratings_df['rating'].mean()
#number of epochs
epochs = 25
#The matricies and vectors we'll use for training are defined here
W = np.random.randn(N,K)
U = np.random.randn(K,M)
user_bias = np.random.randn(N)
movie_bias = np.random.randn(M)
Training!
# Training
for epoch in range(epochs):
#update the weights and biases for each user
for i in tqdm(range(N), desc = 'User', leave=False):
#it's possible that a user hasn't rated a movie because their only
#rating is in the test set
try:
psi = user_to_movies[i]
except KeyError:
continue
#need to keep track of 3 things : a matrix, a vector, and a scalar
A = np.zeros((K,K)) + (LAMBDA*np.eye(K))
b = np.zeros(K)
s = 0
for movieId_j in psi:
A += np.outer(U[:, movieId_j], U[:, movieId_j].T)
b += (ratings_dict[i, movieId_j] - user_bias[i] - movie_bias[movieId_j] - MU)
* U[:, movieId_j]
s += ratings_dict[i, movieId_j] - np.dot(W[i,:].T, U[:, movieId_j])
- movie_bias[movieId_j] - MU
#Update
W[i, :] = np.linalg.solve(A,b)
user_bias[i] = s/(len(psi)+LAMBDA)
#update the weights and biases for each movie
for j in tqdm(range(M), desc = 'Movies', leave=False):
#it's possible that a movie hasn't been rated because it's only rating
is in the test set
try:
omega = movie_to_users[j]
except KeyError:
continue
#need to keep track of 3 things : a matrix, a vector, and a scalar
A = np.zeros((K,K)) + (LAMBDA*np.eye(K))
b = np.zeros(K)
s = 0
for userId_i in omega:
A += np.outer(W[userId_i, :], W[userId_i, :].T)
b += (ratings_dict[userId_i, j] - user_bias[userId_i] - movie_bias[j] - MU)
* W[userId_i, :]
s += ratings_dict[i, movieId_j] - np.dot(W[i,:].T, U[:, movieId_j])
- user_bias[userId_i] - MU
#Update
U[:, j] = np.linalg.solve(A,b)
movie_bias[j] = s/(len(omega)+LAMBDA)
Function for getting ratings and MSE helper
def get_rating(userId, movieId):
return W[userId,:].dot(U[:, movieId]) + user_bias[userId] + movie_bias[movieId] + MU
def calculate_mse(test_df):
mse = []
for i, row in test_df.iterrows():
userId = row['userId']
movieId = row['movieId']
actual_rating = row['rating']
predicted_rating = get_rating(userId,movieId)
print(predicted_rating, actual_rating)
mse.append((actual_rating - predicted_rating) ** 2)
return np.mean(mse)
Moment of truth:
print(calculate_mse(test_df))
#0.55589803128
A little bit better than our collaborative filtering method! Very nice. See you guys in part 5!