USER-USER Collaborative filtering Recommender System in Python

Ankur Tomar
4 min readAug 25, 2017

--

INTRODUCTION

In the previous article, we learned about the content based recommender system which takes the user input and provides with an output that matches most closely to the user’s input. Although it uses some context of the user to provide the recommendation, largely it is still a non-personalized recommender system. In this article, we will start building a system that uses the profile of the given user and provide recommendation completely based on that user’s preference and liking.

Collaborative filtering

Let’s start with the idea of what would I do if I want to do a non-personalized collaborative filtering and calculate the prediction of an item i for the user u. Simply, I would calculate the average of the rating of that item i by adding all the rating values of the item i and divide it by the total number of users U.

Let’s move forward from this intuition and incorporate the behaviour of other users and provide more weight to the ratings of those users who are like me. But how do we check how much a user is similar to me?

To answer this, we will use Karl Pearson’s correlation and see how similar two users are. It is usually calculated over the items that both the users have rated in the past. But there is a problem with this approach. When the number of common ratings are not very large, the similarity value gets biased. It might be possible that 2 users have only 2 ratings in common but the value of correlation is very high or even very close to 1.

To remove this, we weight the similarity. One way to do this is to calculate the numerator at the common ratings only but calculate denominator for all the ratings of both the users. Doing this makes it similar to the cosine similarity function that we discussed in the last article.

Another problem in this approach of predicting the rating of a new item is, not all the users have same level of optimism while rating an item. For a user, an average movie lies at a scale of 2 while another user rates an average movie as 4. So first user’s 2 is similar to second user’s 4. To incorporate this inconsistency, we will calculate the mean of the ratings of the user and then subtract this mean from each of the ratings provided by the user. This will tell us how much above or below average a user rated the movie. After incorporating this, the final rating formula looks like this :

And the Pearson’s correlation looks like this:

With the above understanding, let’s get to the coding part.

The representation of the code below might not be very easy to read, so please go to my GitHub repository to access all the codes of Recommender Systems of this series.

Again, let’s start by firing up the required libraries and importing the datasets.

import pandas as pd
import numpy as np
import math

Ratings=pd.read_csv(“C:\Users\Ankur.Tomar\Desktop\Courses\Recommender\User-User\uu-assignment\data\\ratings.csv”)
Movies=pd.read_csv(“C:\Users\Ankur.Tomar\Desktop\Courses\Recommender\User-User\uu-assignment\data\\movies.csv”)
Tags=pd.read_csv(“C:\Users\Ankur.Tomar\Desktop\Courses\Recommender\User-User\uu-assignment\data\\tags.csv”)

Calculating the mean rating and subtracting from each rating of a user to calculate the adjusted rating.

Mean= Ratings.groupby([‘userId’], as_index = False, sort = False).mean().rename(columns = {‘rating’: ‘rating_mean’})[[‘userId’,’rating_mean’]]
Ratings = pd.merge(Ratings,Mean,on = ‘userId’, how = ‘left’, sort = False)
Ratings[‘rating_adjusted’]=Ratings[‘rating’]-Ratings[‘rating_mean’]
Ratings

Finding the top 30 similar user profiles for each user.

distinct_users=np.unique(Ratings[‘userId’])

user_data_append=pd.DataFrame()

user_data_all=pd.DataFrame()

user1_data= Ratings[Ratings[‘userId’]==320]
user1_mean=user1_data[“rating”].mean()
user1_data=user1_data.rename(columns={‘rating_adjusted’:’rating_adjusted1'})
user1_data=user1_data.rename(columns={‘userId’:’userId1'})
user1_val=np.sqrt(np.sum(np.square(user1_data[‘rating_adjusted1’]), axis=0))

distinct_movie=np.unique(Ratings[‘movieId’])

i=1

for movie in distinct_movie[604:605]:

item_user = Ratings[Ratings[‘movieId’]==movie]

distinct_users1=np.unique(item_user[‘userId’])

j=1

for user2 in distinct_users1:

if j%200==0:

print j , “out of “, len(distinct_users1), i , “out of “, len(distinct_movie)

user2_data= Ratings[Ratings[‘userId’]==user2]
user2_data=user2_data.rename(columns={‘rating_adjusted’:’rating_adjusted2'})
user2_data=user2_data.rename(columns={‘userId’:’userId2'})
user2_val=np.sqrt(np.sum(np.square(user2_data[‘rating_adjusted2’]), axis=0))

user_data = pd.merge(user1_data,user2_data[[‘rating_adjusted2’,’movieId’,’userId2']],on = ‘movieId’, how = ‘inner’, sort = False)
user_data[‘vector_product’]=(user_data[‘rating_adjusted1’]*user_data[‘rating_adjusted2’])

user_data= user_data.groupby([‘userId1’,’userId2'], as_index = False, sort = False).sum()

user_data[‘dot’]=user_data[‘vector_product’]/(user1_val*user2_val)

user_data_all = user_data_all.append(user_data, ignore_index=True)

j=j+1

user_data_all= user_data_all[user_data_all[‘dot’]<1]
user_data_all = user_data_all.sort([‘dot’], ascending=False)
user_data_all = user_data_all.head(30)
user_data_all[‘movieId’]=movie
user_data_append = user_data_append.append(user_data_all, ignore_index=True)
i=i+1

Calculating the predicted rating for each item and ignoring the item if less than 2 similar neighbours.

User_dot_adj_rating_all=pd.DataFrame()

distinct_movies=np.unique(Ratings[‘movieId’])

j=1
for movie in distinct_movies:

user_data_append_movie=user_data_append[user_data_append[‘movieId’]==movie]
User_dot_adj_rating = pd.merge(Ratings,user_data_append_movie[[‘dot’,’userId2',’userId1']], how = ‘inner’,left_on=’userId’, right_on=’userId2', sort = False)

if j%200==0:

print j , “out of “, len(distinct_movies)

User_dot_adj_rating1=User_dot_adj_rating[User_dot_adj_rating[‘movieId’]==movie]

if len(np.unique(User_dot_adj_rating1[‘userId’]))>=2:

User_dot_adj_rating1[‘wt_rating’]=User_dot_adj_rating1[‘dot’]*User_dot_adj_rating1[‘rating_adjusted’]

User_dot_adj_rating1[‘dot_abs’]=User_dot_adj_rating1[‘dot’].abs()
User_dot_adj_rating1= User_dot_adj_rating1.groupby([‘userId1’], as_index = False, sort = False).sum()[[‘userId1’,’wt_rating’,’dot_abs’]]
User_dot_adj_rating1[‘Rating’]=(User_dot_adj_rating1[‘wt_rating’]/User_dot_adj_rating1[‘dot_abs’])+user1_mean
User_dot_adj_rating1[‘movieId’]=movie
User_dot_adj_rating1 = User_dot_adj_rating1.drop([‘wt_rating’, ‘dot_abs’], axis=1)

User_dot_adj_rating_all = User_dot_adj_rating_all.append(User_dot_adj_rating1, ignore_index=True)

j=j+1

User_dot_adj_rating_all = User_dot_adj_rating_all.sort([‘Rating’], ascending=False)

With this we are done with making a recommendation system based on the user similarities. For more in depth understanding, please go through the University of Minnesota’s Recommender system specialisation courses on Coursera.

In the next article, we will see another form of collaborative filtering called ITEM-ITEM Collaborative filtering based Recommender System.

Please go to my GitHub repository to access all the codes of Recommender Systems of this series.

Thanks!

--

--

Ankur Tomar

Data Science | Machine Learning | EXL Services | MS Business Analytics @ University of Minnesota