Item-Based Collaborative Filtering

Sumeet Agrawal
5 min readOct 13, 2021

--

Item-based collaborative filtering is the recommendation system to use the similarity between items using the ratings by users.

The fundamental assumption for this method is that a user gives similar ratings to similar movies. Consider the following table for movie ratings.

In this example, the rating for Movie_1 by User_1 is empty. Let’s predict this rating using item-based collaborative filtering.

  • Step 1: Find the most similar (the nearest) movies to the movie for which you want to predict the rating.

There are multiple ways to find the nearest movies. Here, I use the cosine similarity. In using the cosine similarity, replace the missing value for 0. Look at the graph below. The x-axis represents ratings by User_0 and the y-axis ratings by User_1. Then, we can find points for each movie in the space. For example, Movie_3 corresponds to the point (5,2) in the space.

The cosine similarity uses cos(θ) to measure the distance between two vectors. As θ increases, cos(θ) decreases (cos(θ) = 1 when θ = 0 and cos(θ) = 0 when θ = 90). Therefore, as the value of θ is smaller, the two vectors are considered closer (the similarity gets greater). Since θ1 is the smallest and θ3 is the largest, Movie_3 is the closest to Movie_1, and Movie_2 is the farthest.

What is noteworthy here is that the similarities between movies are determined by ALL USERS. For example, the similarity between Movie_1 and Movie_3 is calculated using the ratings by both of User_0 and User_1.

  • Step 2: Calculate the weighted average of the ratings for the most similar movies by the user.

A user gives similar ratings to similar movies. Therefore, when we predict a rating for a movie by a user, it is reasonable to use the average of ratings for similar movies by the user. Set the number of the nearest neighbors as 2. Then we use Movie_3 and Movie_0 to predict the rating for Movie_1 by User_1.

The rating for Movie_3 by User_1 is 2, and the rating for Movie_0 by User_1 is 3. If Movie_3 and Movie_0 are similar to Movie_1 at the same distance, we can estimate the rating for Movie_1 by User_1 as 2.5. However, if Movie_3 is considered closer to Movie_1, the weight for Movie_3 should be greater than that for Movie_0. Therefore, the predicted rating for Movie_1 will be closer to the rating for Movie_3 as the picture below indicates. Using the cosine similarity as the weight, the predicted rating is 2.374.

Making a Movie Recommender

For better understanding, the dataset with 10 movies and 10 users is used here. (the numbers are randomly selected.)

df
  • 10 movies and 10 users
  • 0 represents missing values.
  • The percentage of users rating movies is 50% (only 50 ratings are given). In real movie datasets, this percentage becomes less than 10%.

I assume that a user did not watch a movie if the user did not rate the movie.

Calculate the Nearest Neighbors

The NearestNeighbors() in the sklearn.neighbors the library can be used to calculate the distance between movies using the cosine similarity and find the nearest neighbors for each movie.

from sklearn.neighbors import NearestNeighborsknn = NearestNeighbors(metric='cosine', algorithm='brute')
knn.fit(df.values)
distances, indices = knn.kneighbors(df.values, n_neighbors=3)

The number of the nearest neighbors (n_neighbors)is set to be 3. Since this includes the movie itself, generally it finds two nearest movies except for the movie itself.

indices[Out] array([[0, 7, 5],
[1, 3, 7],
[2, 1, 6],
....
[9, 0, 7]])

indices shows the indices of the nearest neighbors for each movie. Each row corresponds to the row in the df. The first element in a row is the most similar (nearest) movie. Generally, it is the movie itself. The second element is the second nearest, and the third is the third nearest. For example, in the first row [0,7,5], the nearest movie to movie_0 is itself, the second nearest movie is movie_7, and the third is movie_5

distances[Out] array([[0.00000000e+00, 3.19586183e-01, 4.03404722e-01],        [4.44089210e-16, 3.68421053e-01, 3.95436458e-01],        [0.00000000e+00, 5.20766162e-01, 5.24329288e-01],
....
[1.11022302e-16, 4.22649731e-01, 4.81455027e-01]])

distances shows the distance between movies. A smaller number means the movie is closer. Each number in this array corresponds to the number in the indices array.

Example: Predict a Rating for a Movie by a User

Predicting a rating for a movie by a user is equivalent to calculating the weighted average of ratings for similar movies by the user.

For practice, predict the rating for movie_0 by user_7. First, find the nearest neighbors for movie_0 using NearestNeighbors().

The most similar movies to movie_0 are movie_7 and movie_5, and the distances from movie_0 are 0.3196 and 0.4034, respectively.

The formula to calculate the predicted rating is as follows:

R(m, u) = {∑ ⱼ S(m, j)R(j, u)}/ ∑ ⱼ S(m, j)

  • R(m, u): the rating for movie m by user u
  • S(m, j): the similarity between movie m and movie j
  • j ∈ J where J is the set of the similar movies to movie m

This formula simply implies that the predicted rating for movie m by user u is the weighted average of ratings for similar movies by user u. The weight for each rating, (S(m, k)/∑ ⱼ S(m, j)), becomes greater when movie m and movie k are closer. The denominator of this term makes the sum of all the weights equal to 1.

Let’s predict the rating for movie_0 by user_7, R(0,7):

R(0,7)=[S(0,5)∗R(5,7)+S(0,7)∗R(7,7)]/[S(0,5)+S(0,7)]

Since the distances between movie_0 and movie_5 and between movie_0 and movie_7 are 0.4034 and 0.3196, the similarities are

  • S(0,5) = (1–0.4034)
  • S(0,7) = (1–0.3196).

Because R(5,7) = 2 and R(7,7) = 3, the predicted R(0,7) is 2.5328.

--

--