Recommendation Systems : User-based Collaborative Filtering using N Nearest Neighbors

Ashay Pathak
SFU Professional Computer Science
9 min readFeb 25, 2019

Ashay Pathak, Chatana Mandava, Ritesh Patel

Introduction

Collaborative Filtering is a technique which is widely used in recommendation systems and is rapidly advancing research area. The two most commonly used methods are memory-based and model-based.

In this post, we will only focus on (User-Based Collaborative Filtering) UB-CF which is a memory-based method. The main idea behind UB-CF is that people with similar characteristics share similar taste. For example, if you are interested in recommending a movie to our friend Bob, suppose Bob and I have seen many movies together and we rated them almost identically. It makes sense to think that in future as well we would continue to like similar movies and use this similarity metric to recommend movies.

Let’s try to implement UB-CF and generate a list of movies that our friend Bob a.k.a active user might be interested in watching. The motivation behind writing this post is to deep dive into the algorithm and understand how UB-CF actually works. Most of the content of this post is inspired by a Course on Coursera.

User-User Collaborative Filtering

The method identifies users that are similar to the queried user and estimate the desired rating to be the weighted average of the ratings of these similar users.

Collaborative Filtering : source here

We would be doing the recommendation’s on MovieLens Dataset. The programming language used is python and the data analysis work is mostly done using pandas library. IDE used is jupyter notebook.

So before starting I would like to give the list of libraries used :

  1. Pandas
  2. Numpy
  3. sklearn

So lets move forward and understand the concepts behind recommendations. I have attached some code snippets and outputs in the blog for better understanding. The whole ipynb file is attached at the end of the blog.

Score function

It’s easy to come up with a function for non-personalized collaborative filtering (i.e we don’t consider active user’s likes, dislikes and rating from the past) that returns a score taking user u and item i as the input parameters. The function outputs a score that quantifies how strongly does a user u likes/prefers item i.

So this is usually done using the ratings of other people similar to the user. This all would be discussed in detail later. For now the formula I have used is,

score function

Where ‘s’ is the predicted score, ‘u’ is the user, ‘i’ is the item, ‘r’ is the rating given by the user and ‘w’ is the weight.

In this case our score is equal to the sum of the ratings that each user gave to that item subtracting the average rating of that user multiplied with some weight which is of how much this user is similar or supposed to contribute to the predictions of other user. This is weight between user u and v. The score ranges between 0 to 1 where 0 is low and 1 is high. Everything looks perfect, then why did we subtract the average ratings from each users rating and why did we use weighted average instead of simple mean?

The problem is with the types of users we are handling. It starts with the fact that people rate often on very different scales. I may be positive and optimistic user where I will rate the movie I liked as 4 out of 5 but some other user who is less optimistic or has some high standards may rate his favorite movie as 2 out of 5. Here his 2 is my 4. The tweaks to make it better is, we can increase the efficiency of this algorithm if we normalize user’s rating. One way to do that is to say we are going to compute s(u,i) i.e score as the average rating that user gives to each item plus some deviation and the deviation is going to be how much this item is better or worse than average.

I have used the cosine similarity to calculate the weight given in the above formula. I have also used the notion of neighborhood which would be discussed in this blog as we move on.

To normalize the data in the above manner, some data analysis is required in pandas. You can get whole code at the end. For the blog, i will focus on the important concepts.

import pandas as pdmovies = pd.read_csv("movies.csv",encoding="Latin1")
Ratings = pd.read_csv("ratings.csv")
Tags = pd.read_csv("tags.csv",encoding="Latin1")
Mean = Ratings.groupby(by="userId",as_index=False)['rating'].mean()
Rating_avg = pd.merge(Ratings,Mean,on='userId')
Rating_avg['adg_rating']=Rating_avg['rating_x']-Rating_avg['rating_y']
Rating_avg.head()
Normalized ratings

So now we are done calculating the normalized rating for a user. The above data would be used to calculate the final score for the user later.

From here we will now focus on some important concepts related to recommendation systems.

Cosine Similarity

For the above formula we need to find the users who have similar thoughts. This sounds so interesting to find a user who has similar liking’s and disliking. But the question is how do we find the similarity?

To answer this, we will use Cosine Similarity and see how similar the users are. It is usually calculated over the ratings that both the users have rated in the past.

In our example, I have used cosine_similarity function of sklearn to calculate the similarity. But before that we have to perform some pre-processing and clean the data.

from sklearn.metrics.pairwise import cosine_similarityfinal=pd.pivot_table(Rating_avg,values='adg_rating',index='userId',columns='movieId')
pivot table

This contains some lots of NaN value since every user has not seen all the movies and that’s the reason this type of matrix is called sparse matrix. Methods like matrix factorization are used to deal with this sparsity but we would not focus on it in this blog. Next step and one of the important step is to replace this NaN values.

There are two methods commonly used for this :

  1. Use the user average over the row.
  2. User the movie average over the column.

I have used both the methods and you can get it in the code below. But for explaining I would use the movie average method.

# Replacing NaN by Movie Average
final_movie = final.fillna(final.mean(axis=0))
Replacing NaN values by movie average

Now, next step is to calculate the similarity between the users.

# user similarity on replacing NAN by item(movie) avg
cosine = cosine_similarity(final_movie)
np.fill_diagonal(cosine, 0 )
similarity_with_movie =pd.DataFrame(cosine,index=final_movie.index)
similarity_with_movie.columns=final_user.index
similarity_with_movie.head()
User User cosine similarity

Lets check our self whether what we calculated really makes sense !!

a = get_user_similar_movies(370,86309)
a = a.loc[ : , ['rating_x_x','rating_x_y','title']]
a.head()
movies similar

From above image we can see that the similarity we generated is true since both the given users (370,86309) have almost same ratings and liking’s.

So we are done with calculating the similarities between the users but I am not still happy. I would discuss the reason in the next step.

Neighborhood for User (K)

So from above we can see that we have calculated the similarities for all the users. But since being a Big Data student, the complexity of the problem always drive me. By this I mean that the recommendation system works with the huge data and hence it becomes very important to maintain and capture only the important and necessary highlights from the data.

To explain this using our example of movie recommendation system, the matrix we obtained above is (862*862), as there are 862 unique users in the data. This number is still small when compared to the data original system would be working on. Lets think about Amazon. It would have more than millions of users in its database and so while calculating the score for any item it would not be a good solution or method to look at all the other users all the time. Hence to overcome this we generate a notion of neighborhood. This includes only the set of (K) similar users for a particular user.

Now let’s take further steps to implement the idea. In our example I have taken the value of k as 30. So we would have 30 nearest neighbor for all the users.

I have used my custom function find_n_neighbours which takes the similarity matrix and the value of n as input and returns the nearest n neighbors for all the users. You can find the code in the notebook given at the end of the blog.

# top 30 neighbours for each user
sim_user_30_m = find_n_neighbours(similarity_with_movie,30)
sim_user_30_m.head()
top 30 neighbors for users

So now if you think then we have seriously reduced the number of unnecessary computations. Now, we are ready to calculate the score for a item now.

Generating the final score S(u,i)

Wow! we are done with the process. I know we went through some real technical stuff, but the time we spent is worth as we have reached the last step.

Here we will try to predict the score for the movie the given user has not seen.

score = User_item_score(320,7371)
print(score)
predicted score

So our system predicted the score to be 4.25 which is really good. I think user (370) could like the movie with id (7371).

Now lets give a final touch to our system. I think most of us would have used Netflix or Hotstar. So when we open the app it shows the item you make like.

I always get fascinated by the recommendation as they turn out to be the movies I like. This all happens through the recommendation system itself. We will now do the same and try to predict the top 5 movies the given user may like.

The logic is not that tough. We already generated the score for one item. Similarly we can generate the score for the other items with the same user.

But, now again keeping Big Data into consideration does it make sense to generate score for all the other items ??

I think NOOOOOO!!!

Lets consider user (U) and user (K). Now suppose K is not in the neighborhood of U ,is it possible for user U to like a movie which K has seen and rated. Mostly No.

I think you got the trick. Yes, we are just interested in calculating the scores for the items that their neighbor users have seen.

We are really successful. We reduced the computation from 2500 to N ( where N is the set of movies my neighborhood liked ) and which is very less than 2500.

So finally lets recommend the movies.

User_item_score1 is my custom function which uses our above discussion to calculate predictions

user = int(input("Enter the user id to whom you want to recommend : "))
predicted_movies = User_item_score1(user)
print(" ")
print("The Recommendations for User Id : ",user)
print(" ")
for i in predicted_movies:
print(i)

At last!!!! We did it. We have our own recommendation system.

The representation of the code below might not be very easy to read, so please go to my GitHub repository to access the code.

The required Datasets can be obtained from here.

I hope you liked the blog. For any queries you can mail me. Stay tuned for more blogs since this is one of my areas of interest and I would definitely post some new stuff.

Thanks..!!

📝 Read this story later in Journal.

🗞 Wake up every Sunday morning to the week’s most noteworthy Tech stories, opinions, and news waiting in your inbox: Get the noteworthy newsletter >

--

--

Ashay Pathak
SFU Professional Computer Science

Data Science | Machine Learning | Deep Learning | Masters in Big Data | Simon Fraser University