Collaborative Filtering in Recommender System: An Overview

Evelyn
6 min readNov 4, 2023

A review on how collaborative filtering uses data from other users or items to make personalized recommendations

Why use recommender system?

A recommender system helps user to quickly find the most relevant options when searching for items in e-commerce sites such as Amazon or contents in entertainment providers such as YouTube or Netflix. Having a good recommender system in place can significantly affect the success of these platforms by improving customer satisfaction and engagement.

Imagine trying to decide which movie to watch without being shown a list of recommended movies or shows, it would be a much less pleasant experience. Some users might even just opt to not watch anything at all due to the lack of idea. The fact is, sometimes the users don’t even know what exactly are they looking for until it pops up as a recommended items. On top of that, great product recommendations can further encourage impulse buying, which in turns will increase sales and revenue.

Collaborative Filtering

Recommender system can be either personalized or non-personalized. Non-personalized system can be simpler but personalized system tends to work better as it caters to the needs of each individual user. Collaborative filtering is a common method of personalized recommender system which filters information such as interactions data from other similar users. Since it works by predicting user ratings, it is considered as performing regression task. There are two general types of collaborative filtering:

  1. User to user
  2. Item to item

User to user collaborative filtering basically operates under the assumption that users who gave similar ratings to a certain item are likely to have the same preference for other items as well. Therefore this method mainly relies on finding similarity between users. However, in some cases, user preference might be to abstract to be broken down. This is where item to item collaborative filtering comes in handy. Here, similarity between items is used instead of similarity between users. In this article, we’ll be focusing on user to user collaborative filtering.

Workflow of collaborative filtering:

User to user collaborative filtering
  1. The process starts by converting the rating data into a utility matrix where the list of users are the rows and list of items are the columns.
  2. The next step is the Neighborhood collaborative filtering model where we use a similarity function to compute similarity between users with the output being a similarity matrix.
  3. A certain amount (K) of similar users (also known as neighbors) is taken and the rating prediction will be obtained by doing regression on these neighbors’ rating data.
  4. The items will then be sorted based on the highest rating and the top items will be recommended to the user.

Code

First, let’s define a function to compute a baseline rating, which is the global average rating with the addition of user and item bias.

def baseline_prediction(data, userid, movieid):
"""Function to calculate baseline prediction from user and movie """

# calculate global mean
global_mean = data.stack().dropna().mean()

# calculate user mean
user_mean = data.loc[userid, :].mean()

# calculate item mean
item_mean = data.loc[:, movieid].mean()

# calculate user bias
user_bias = global_mean - user_mean

# calculate item bias
item_bias = global_mean - item_mean

# calculate baseline
baseline_ui = global_mean + user_bias + item_bias

return baseline_ui

Next, we’ll add on another function to find the neighbors based on the similarity score calculated using cosine similarity. We’ll use

# calculate the mean rating from all user for each movie
user_mean = data.mean(axis=0)
user_removed_mean_rating = (data - user_mean).fillna(0)


def find_neighbor(user_removed_mean_rating, userid, k=5):
# Generate the similarity score
n_users = len(user_removed_mean_rating.index)
similarity_score = np.zeros(n_users)

# get user 1 rating vector
user_target = user_removed_mean_rating.loc[userid].values.reshape(1, -1)

# Iterate all users
for i, neighbor in enumerate(user_removed_mean_rating.index):
# Extract neighbor user vector
user_neighbor = user_removed_mean_rating.loc[neighbor].values.reshape(1, -1)

# Calculate the similarity (we use Cosine Similarity)
sim_i = cosine_similarity(user_target, user_neighbor)

# Append
similarity_score[i] = sim_i

# Sort in descending orders of similarity_score
sorted_idx = np.argsort(similarity_score)[::-1]

# sort similarity score , descending
similarity_score = np.sort(similarity_score)[::-1]

# get user closest neighbor
closest_neighbor = user_removed_mean_rating.index[sorted_idx[1:k + 1]].tolist()

# slice neighbour similarity
neighbor_similarity = list(similarity_score[1:k + 1])

# return closest_neighbor
return {
'closest_neighbor': closest_neighbor,
'closest_neighbor_similarity': neighbor_similarity
}

Now that we have a function to obtain the neighbors, it can be utilized to predict item ratings based on the rating data from these neighbors.

def predict_item_rating(userid, movieid, data, neighbor_data, k,
max_rating=5, min_rating=1):
"""Function to predict rating on userid and movieid"""

# calculate baseline (u,i)
baseline = baseline_prediction(data=data, userid=userid, movieid=movieid)

# for sum
sim_rating_total = 0
similarity_sum = 0
# loop all over neighbor
for i in range(k):
# retrieve rating from neighbor
neighbour_rating = data.loc[neighbor_data['closest_neighbor'][i], movieid]

# skip if nan
if np.isnan(neighbour_rating):
continue

# calculate baseline (ji)
baseline = baseline_prediction(data=data,
userid=neighbor_data['closest_neighbor'][i], movieid=movieid)

# substract baseline from rating
adjusted_rating = neighbour_rating - baseline

# multiply by similarity
sim_rating = neighbor_data['closest_neighbor_similarity'][i] * adjusted_rating

# sum similarity * rating
sim_rating_total += sim_rating

#
similarity_sum += neighbor_data['closest_neighbor_similarity'][i]

# avoiding ZeroDivisionError
try:
user_item_predicted_rating = baseline + (sim_rating_total / similarity_sum)

except ZeroDivisionError:
user_item_predicted_rating = baseline

# checking the boundaries of rating,
if user_item_predicted_rating > max_rating:
user_item_predicted_rating = max_rating

elif user_item_predicted_rating < min_rating:
user_item_predicted_rating = min_rating

return user_item_predicted_rating

The last function we’ll need is to sort the predicted rating and recommend the top items with the highest predicted rating.

def recommend_items(data, userid, n_neighbor, n_items,
recommend_seen=False):
""" Function to generate recommendation on given user_id """

# find neighbor
neighbor_data = find_neighbor(user_removed_mean_rating=user_removed_mean_rating,
userid=userid, k=n_neighbor)

# create empty dataframe to store prediction result
prediction_df = pd.DataFrame()
# create list to store prediction result
predicted_ratings = []

# mask seen item
mask = np.isnan(data.loc[userid])
item_to_predict = data.columns[mask]

if recommend_seen:
item_to_predict = data.columns

# loop all over movie
for movie in item_to_predict:
# predict rating
preds = predict_item_rating(userid=userid, movieid=movie,
data=data,
neighbor_data=neighbor_data, k=5)

# append
predicted_ratings.append(preds)

# assign movieId
prediction_df['movieId'] = data.columns[mask]

# assign prediction result
prediction_df['predicted_ratings'] = predicted_ratings

#
prediction_df = (prediction_df
.sort_values('predicted_ratings', ascending=False)
.head(n_items))

return prediction_df

Implementation Example

The data used is the Amazon movie ratings available from Kaggle. Let’s take a look on how the utility matrix should looks like:

To recommend the top 5 movies to the first user on the list based on 5 other similar users, we can use the recommend items we have defined above as shown here:

user_1_recommendation = recommend_items(data=data, userid='A3R5OBKS7OM2IR', n_neighbor=5, n_items=5,
recommend_seen=False)

Result:

This means that we should recommend movie 3, 90, 26, 154, and 144 to user_id: ‘A3R5OBKS7OM2IR’ with the prediction that this particular user will give a rating of 5.0 to all these movies.

Conclusion & Recommendations

Collaborative filtering is a method to build a recommender system that utilizes data from other similar users or items to predict how users will rate items that they have not purchased or viewed yet. This rating prediction will then be used to generate a list of possible top items to be recommended. This is an effective strategy to increase user engagement and sales as the recommendations are highly personalized to fit the preference of each individual user.

The example implementation in this article is aimed to provide conceptual understanding on how collaborative filtering works. Further optimization can be done by tuning the hyperparameter such as the similarity function or the number of neighbors to be considered.

Reference:

https://neptune.ai/blog/recommender-systems-metrics

The code used in this article is also available at GitHub.

--

--