Collaborative Filtering in Recommender System: An Overview

6 min readNov 4, 2023

A review on how collaborative filtering uses data from other users or items to make personalized recommendations

Why use recommender system?

A recommender system helps user to quickly find the most relevant options when searching for items in e-commerce sites such as Amazon or contents in entertainment providers such as YouTube or Netflix. Having a good recommender system in place can significantly affect the success of these platforms by improving customer satisfaction and engagement.

Imagine trying to decide which movie to watch without being shown a list of recommended movies or shows, it would be a much less pleasant experience. Some users might even just opt to not watch anything at all due to the lack of idea. The fact is, sometimes the users don’t even know what exactly are they looking for until it pops up as a recommended items. On top of that, great product recommendations can further encourage impulse buying, which in turns will increase sales and revenue.

Collaborative Filtering

Recommender system can be either personalized or non-personalized. Non-personalized system can be simpler but personalized system tends to work better as it caters to the needs of each individual user. Collaborative filtering is a common method of personalized recommender system which filters information such as interactions data from other similar users. Since it works by predicting user ratings, it is considered as performing regression task. There are two general types of collaborative filtering:

User to user
Item to item

User to user collaborative filtering basically operates under the assumption that users who gave similar ratings to a certain item are likely to have the same preference for other items as well. Therefore this method mainly relies on finding similarity between users. However, in some cases, user preference might be to abstract to be broken down. This is where item to item collaborative filtering comes in handy. Here, similarity between items is used instead of similarity between users. In this article, we’ll be focusing on user to user collaborative filtering.

Workflow of collaborative filtering:

The process starts by converting the rating data into a utility matrix where the list of users are the rows and list of items are the columns.
The next step is the Neighborhood collaborative filtering model where we use a similarity function to compute similarity between users with the output being a similarity matrix.
A certain amount (K) of similar users (also known as neighbors) is taken and the rating prediction will be obtained by doing regression on these neighbors’ rating data.
The items will then be sorted based on the highest rating and the top items will be recommended to the user.

Code

First, let’s define a function to compute a baseline rating, which is the global average rating with the addition of user and item bias.

def baseline_prediction(data, userid, movieid):
    """Function to calculate baseline prediction from user and movie """

    # calculate global mean
    global_mean = data.stack().dropna().mean()

    # calculate user mean
    user_mean = data.loc[userid, :].mean()

    # calculate item mean
    item_mean = data.loc[:, movieid].mean()

    # calculate user bias
    user_bias = global_mean - user_mean

    # calculate item bias
    item_bias = global_mean - item_mean

    # calculate baseline
    baseline_ui = global_mean + user_bias + item_bias

    return baseline_ui

Next, we’ll add on another function to find the neighbors based on the similarity score calculated using cosine similarity. We’ll use

# calculate the mean rating from all user for each movie
user_mean = data.mean(axis=0)
user_removed_mean_rating = (data - user_mean).fillna(0)


def find_neighbor(user_removed_mean_rating, userid, k=5):
    # Generate the similarity score
    n_users = len(user_removed_mean_rating.index)
    similarity_score = np.zeros(n_users)

    # get user 1 rating vector
    user_target = user_removed_mean_rating.loc[userid].values.reshape(1, -1)

    # Iterate all users
    for i, neighbor in enumerate(user_removed_mean_rating.index):
        # Extract neighbor user vector
        user_neighbor = user_removed_mean_rating.loc[neighbor].values.reshape(1, -1)

        # Calculate the similarity (we use Cosine Similarity)
        sim_i = cosine_similarity(user_target, user_neighbor)

        # Append
        similarity_score[i] = sim_i

    # Sort in descending orders of similarity_score
    sorted_idx = np.argsort(similarity_score)[::-1]

    # sort similarity score , descending
    similarity_score = np.sort(similarity_score)[::-1]

    # get user closest neighbor
    closest_neighbor = user_removed_mean_rating.index[sorted_idx[1:k + 1]].tolist()

    # slice neighbour similarity
    neighbor_similarity = list(similarity_score[1:k + 1])

    # return closest_neighbor
    return {
        'closest_neighbor': closest_neighbor,
        'closest_neighbor_similarity': neighbor_similarity
    }

Now that we have a function to obtain the neighbors, it can be utilized to predict item ratings based on the rating data from these neighbors.

def predict_item_rating(userid, movieid, data, neighbor_data, k,
                        max_rating=5, min_rating=1):
    """Function to predict rating on userid and movieid"""

    # calculate baseline (u,i)
    baseline = baseline_prediction(data=data, userid=userid, movieid=movieid)

    # for sum
    sim_rating_total = 0
    similarity_sum = 0
    # loop all over neighbor
    for i in range(k):
        # retrieve rating from neighbor
        neighbour_rating = data.loc[neighbor_data['closest_neighbor'][i], movieid]

        # skip if nan
        if np.isnan(neighbour_rating):
            continue

        # calculate baseline (ji)
        baseline = baseline_prediction(data=data,
                                       userid=neighbor_data['closest_neighbor'][i], movieid=movieid)

        # substract baseline from rating
        adjusted_rating = neighbour_rating - baseline

        # multiply by similarity
        sim_rating = neighbor_data['closest_neighbor_similarity'][i] * adjusted_rating

        # sum similarity * rating
        sim_rating_total += sim_rating

        #
        similarity_sum += neighbor_data['closest_neighbor_similarity'][i]

    # avoiding ZeroDivisionError
    try:
        user_item_predicted_rating = baseline + (sim_rating_total / similarity_sum)

    except ZeroDivisionError:
        user_item_predicted_rating = baseline

    # checking the boundaries of rating,
    if user_item_predicted_rating > max_rating:
        user_item_predicted_rating = max_rating

    elif user_item_predicted_rating < min_rating:
        user_item_predicted_rating = min_rating

    return user_item_predicted_rating

The last function we’ll need is to sort the predicted rating and recommend the top items with the highest predicted rating.

def recommend_items(data, userid, n_neighbor, n_items,
                    recommend_seen=False):
    """ Function to generate recommendation on given user_id """

    # find neighbor
    neighbor_data = find_neighbor(user_removed_mean_rating=user_removed_mean_rating,
                                  userid=userid, k=n_neighbor)

    # create empty dataframe to store prediction result
    prediction_df = pd.DataFrame()
    # create list to store prediction result
    predicted_ratings = []

    # mask seen item
    mask = np.isnan(data.loc[userid])
    item_to_predict = data.columns[mask]

    if recommend_seen:
        item_to_predict = data.columns

    # loop all over movie
    for movie in item_to_predict:
        # predict rating
        preds = predict_item_rating(userid=userid, movieid=movie,
                                    data=data,
                                    neighbor_data=neighbor_data, k=5)

        # append
        predicted_ratings.append(preds)

    # assign movieId
    prediction_df['movieId'] = data.columns[mask]

    # assign prediction result
    prediction_df['predicted_ratings'] = predicted_ratings

    #
    prediction_df = (prediction_df
                     .sort_values('predicted_ratings', ascending=False)
                     .head(n_items))

    return prediction_df

Implementation Example

The data used is the Amazon movie ratings available from Kaggle. Let’s take a look on how the utility matrix should looks like:

To recommend the top 5 movies to the first user on the list based on 5 other similar users, we can use the recommend items we have defined above as shown here:

user_1_recommendation = recommend_items(data=data, userid='A3R5OBKS7OM2IR', n_neighbor=5, n_items=5,
                                        recommend_seen=False)

Result:

This means that we should recommend movie 3, 90, 26, 154, and 144 to user_id: ‘A3R5OBKS7OM2IR’ with the prediction that this particular user will give a rating of 5.0 to all these movies.

Conclusion & Recommendations

Collaborative filtering is a method to build a recommender system that utilizes data from other similar users or items to predict how users will rate items that they have not purchased or viewed yet. This rating prediction will then be used to generate a list of possible top items to be recommended. This is an effective strategy to increase user engagement and sales as the recommendations are highly personalized to fit the preference of each individual user.

The example implementation in this article is aimed to provide conceptual understanding on how collaborative filtering works. Further optimization can be done by tuning the hyperparameter such as the similarity function or the number of neighbors to be considered.

Reference:

https://neptune.ai/blog/recommender-systems-metrics

What Is Collaborative Filtering: A Simple Introduction

How recommender systems use collaborative filtering.

builtin.com

The code used in this article is also available at GitHub.