Decoding the Matrix: A Journey into the Algorithms of Recommender Systems — Part 3: Collaborative Filtering

Abelardo Riojas
8 min readJun 22, 2023

--

With the power of friendship anything is possible, Neo!

Collaborative filtering is the process of assigning ratings based on the ratings of other users who are similar to you. In part 1, we saw how non-personalized recommendations assign scores to a user j by using the average rating.

Back then our scoring function s, takes in only as input user j. The score is the sum over all users who have rated item j(omega_j) of r_ij which is the rating user i gave item j. This is all divided by the magnitude of the set, which is just the number of ratings item j has.

The goal of this section is to personalize the ratings so the scoring function is not just dependent on item j but also the user i. The center piece of all of this will be the User-Ratings matrix R, a sparse NxM matrix of all N users and their ratings across M items.

User-User Collaborative Filtering

Here’s a very, very small and simple example of a User-Ratings matrix

Clearly, Star Wars is a good recommendation for Bob as his ratings are similar to Alice’s. Bow how do we modify the above equation to account for that? We might use some sort of weighting to make Alice’s ratings more important compared to Carol’s.

That kind of shifts the problem one step down (how do we calculate these weights?) but it’s an important step of this derivation.

There’s also another problem, some users might be really biased and rate items on different scales. A 3 for you might be a 1 for me, and a 5 star movie might just be an alright 4. To fix this, we define a deviation to be the difference between a rating and a user’s mean rating.

So we’re actually not interested in an item’s average rating but rather in an item’s deviation from how you normally feel.

Now our scoring function looks like the following:

To calculate the weights we’ll use the Pearson correlation coefficient on just the data we have (remember the matrix is sparse)

Where Psi_{i,i’} is the set of items that both user i and user i’ have rated.

The Pearson correlation gives off those old high school stats vibes, yes, but when you consider that we’re dealing with deviations you realize that it’s actually identical to the simpler, sexier cosine similarity.

But there is a problem, if Psi_{i,i’} is either the null set or very small. The solution is to add some logic in the calculation of the weights that only uses Psi_{i,i’} if it’s magnitude is larger than say, 5 items.

In practice the weights are precomputed and updated on a regular basis. And when actually calculating recommendation scores, only the top 25 or 50 users (a neighborhood if you will) with the highest absolute correlation to you are used in the calculation of scores s(i,j).

It’s important to note that the “top” users should be based on absolute weight since negatively correlation users can provide valuable information. What matters is the strength of that correlation, whichever direction it is.

So in retrospect, both Alice and Carol’s recommendations are valuable to Bob. User-User collaborative filtering tells us that we should recommend the movies that Alice really likes to Bob, as well as the movies that Carol really doesn’t like.

User-User Collaborative Filtering In Code

Imports we need, preprocessing the data, and initializing the weight matrix

import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd

#reading in the movie lens dataset
ratings_df = pd.read_csv('/content/drive/MyDrive/movie_lens/rating.csv')

#there's way too many ratings so let's filter out anyone who isn't one of
#the top 1000 raters or any movie not in the top 1000 rated
user_counts = ratings_df['userId'].value_counts().sort_values(ascending=False)
movie_counts = ratings_df['movieId'].value_counts().sort_values(ascending=False)

top_users = set(user_counts.index[:1000])
top_movies = set(movie_counts.index[:1000])

filtered_df = ratings_df[ratings_df['userId'].isin(top_users) &
ratings_df['movieId'].isin(top_movies)]

#need something to evaluate on so let's make a test dataframe to look at the MSE later
train_df, test_df = train_test_split(filtered_df, test_size=0.2, random_state=42)

#calculating the average rating of each user
user_avg_rating = train_df.groupby('userId')['rating'].mean()

#creating a pivot table for easy access to ratings
#could also make two dictionaries to easy access to a movie's ratings and a user's ratings
pivot_table = train_df.pivot(index='movieId', columns='userId', values='rating')

# Initialize the weight matrix
num_users = len(top_users)
weight_matrix = np.zeros((num_users, num_users))

Calculating the weights for each user.

for i, userId_1 in tqdm(enumerate(user_avg_rating.index)):
#what we'll need from each user
user_i_movies = pivot_table[userId_1].dropna().index
user_1_avg_rating = user_avg_rating[userId_1]

for j, userId_2 in enumerate(user_avg_rating.index):

if userId_1 == userId_2:
continue

#finding the common movies using set intersection

user_j_movies = pivot_table[userId_2].dropna().index
common_movies = np.intersect1d(user_i_movies, user_j_movies)

#logic if the set is too small
if len(common_movies) < 5:
continue

#calculating the average ratings for user 1 and user 2
user_2_avg_rating = user_avg_rating[userId_2]

#getting the ratings for the common movies from the pivot table
user_1_ratings = pivot_table.loc[common_movies, userId_1]
user_2_ratings = pivot_table.loc[common_movies, userId_2]

#calculating the Pearson correlation coefficient
numerator = np.sum((user_1_ratings - user_1_avg_rating) *
(user_2_ratings - user_2_avg_rating))
denominator = np.sqrt(np.sum((user_1_ratings - user_1_avg_rating) ** 2))
* np.sqrt(np.sum((user_2_ratings - user_2_avg_rating) ** 2))

#updating the weight matrix
weight_matrix[i, j] = numerator / denominator

Scoring function following the model we described above

def scoring_function(userId, movieId, k=25):
#finding the user index in the weight matrix
user_index = np.where(user_avg_rating.index == userId)[0][0]

#finding the top k users by absolute weight
neighbors = np.argsort(np.abs(weight_matrix[user_index, :]))[::-1][:k]

#finding the rating each neighbor gave the movie
neighbor_ratings = pivot_table.loc[movieId, user_avg_rating.index[neighbors]]

#finding the average rating of each neighbor
neighbor_avg_ratings = user_avg_rating[user_avg_rating.index[neighbors]]

#calculating the numerator and denominator for scoring calculation
numerator = np.sum(weight_matrix[user_index, neighbors] *
(neighbor_ratings - neighbor_avg_ratings))
denominator = np.sum(np.abs(weight_matrix[user_index, neighbors]))

#calculating the final score
user_avg = user_avg_rating[userId]
score = user_avg + (numerator / denominator) if denominator != 0 else user_avg

return score

MSE function and evaluation

def calculate_mse(test_df, scoring_function):
mse = 0

for _, row in test_df.iterrows():
userId = row['userId']
movieId = row['movieId']
rating = row['rating']

#calculating the score using the scoring function
score = scoring_function(userId, movieId)

#calculating the squared difference between the score and the actual rating
squared_diff = (score - rating) ** 2

#accumulating the squared difference to calculate the MSE
mse += squared_diff

#calculating the average MSE
mse /= len(test_df)

return mse

#calculate the MSE on the test set using the scoring function
mse = calculate_mse(test_df, scoring_function)

print("Mean Squared Error (MSE):", mse)

#Output: Mean Squared Error (MSE): 0.6302073618239379

On average we’re about .63 of a star off from the true recommendation. Not too shabby!

Item-Item Collaborative Filtering

It’s a lot like user-user but we’re looking at things column-wise instead of row-wise. Two items are similar if their column vector’s are similar.

Here’s how we calculate item-item correlation, it’s a lot like user-user.

Here’s the weight calculation:

Where Omega_{jj’} is the set of all users who rated both item j and j’ and r_j bar is the average rating for item j.

Scoring looks a little different but it’s pretty much the same principles.

Where Psi_i is the set of all items user i has rated.

The logic here is if user i likes item j’ more than other users do, and the items are correlated, then there’s a high chance he’ll like item j.

Some final notes then code

Item-Item is usually faster than User-User because in practice you will have way more users than items. Therefore the weight matrix will usually be smaller.

You still want to use a top 25 or top 50 neighborhood of items when calculating scores.

Item-Item Collaborative Filtering In Code

Same as before

#ITEM-ITEM CF

#same beginning as before
import numpy as np
from sklearn.model_selection import train_test_split
from math import sqrt
import pandas as pd

ratings_df = pd.read_csv('/content/drive/MyDrive/movie_lens/rating.csv')

#filtering the dataframe based on the top users and top movies
user_counts = ratings_df['userId'].value_counts().sort_values(ascending=False)
movie_counts = ratings_df['movieId'].value_counts().sort_values(ascending=False)
top_users = set(user_counts.index[:1000])
top_movies = set(movie_counts.index[:1000])

filtered_df = ratings_df[ratings_df['userId'].isin(top_users) &
ratings_df['movieId'].isin(top_movies)]

train_df, test_df = train_test_split(filtered_df, test_size=0.2, random_state=42)

#now we want to calculate the average rating of each item
movie_avg_rating = train_df.groupby('movieId')['rating'].mean()

#same pivot table from before except i switched the rows and columns
pivot_table = train_df.pivot(index='userId', columns='movieId', values='rating')


#initializing the weight matrix
num_items = len(top_movies)
item_item_weight_matrix = np.zeros((num_items, num_items))

Finding the weights for each movie

for i, movieId_i in tqdm(enumerate(movie_avg_rating.index)):
#what we'll need from each movie
movie_i_users = pivot_table[movieId_i].dropna().index
movie_i_avg_rating = movie_avg_rating[movieId_i]

for j, movieId_j in enumerate(movie_avg_rating.index):

if movieId_i == movieId_j:
continue

#finding the common users using set intersection
movie_j_users = pivot_table[movieId_j].dropna().index
common_users = np.intersect1d(movie_i_users, movie_j_users)

if len(common_users) < 5:
continue

movie_j_average_rating = movie_avg_rating[movieId_j]

#getting the ratings for the common users from the pivot table
movie_i_ratings = pivot_table.loc[common_users, movieId_i]
movie_j_ratings = pivot_table.loc[common_users, movieId_j]

#calculating the Pearson correlation coefficient
numerator = np.sum((movie_i_ratings - movie_i_avg_rating)
* (movie_j_ratings - movie_j_average_rating))
denominator = np.sqrt(np.sum((movie_i_ratings - movie_i_avg_rating) ** 2))
* np.sqrt(np.sum((movie_j_ratings - movie_j_average_rating) ** 2))

#updating the weight matrix
item_item_weight_matrix[i, j] = numerator / denominator

Scoring function following the model we made above

def item_item_scoring_function(userId, movieId, k=25):
#finding the item index in the weight matrix
item_index = np.where(movie_avg_rating.index == movieId)[0][0]

#finding the top k items by absolute weight
neighbors = np.argsort(np.abs(item_item_weight_matrix[item_index, :]))[::-1][:k]

#finding the rating each neighbor movie has
neighbor_ratings = pivot_table.loc[userId, movie_avg_rating.index[neighbors]]

#finding the average rating of each neighbor
neighbor_avg_ratings = movie_avg_rating[movie_avg_rating.index[neighbors]]

#calculating the numerator and denominator for scoring calculation
numerator = np.sum(item_item_weight_matrix[item_index, neighbors] *
(neighbor_ratings - neighbor_avg_ratings))
denominator = np.sum(np.abs(item_item_weight_matrix[item_index, neighbors]))

#calculating the final score
item_avg = movie_avg_rating[movieId]
score = item_avg + (numerator / denominator) if denominator != 0 else item_avg

return score

MSE evaluation

mse = calculating_mse(test_df, item_item_scoring_function)

print("Mean Squared Error (MSE):", mse)

#Output is the same as user-user: 0.6302073618239379

Same output as User-User which wasn’t bad, and our weight matrix is smaller. Woo hoo! This ends part 3 😎

Catch me in part 4!

--

--

Abelardo Riojas

Musical-data addict and Masters of Data Science candidate at The University of Texas Austin!