ITEM-ITEM Collaborative filtering Recommender System in Python

Ankur Tomar
4 min readOct 22, 2017

--

INTRODUCTION

In the previous article, we learned about one method of collaborative filtering called User based collaborative filtering which analysed the behaviour of users’ and predicted what user will like based on its similarity with other users. But this method is buffeted by 2 major problems:

1. Data Sparsity: In case of large number of items, number of items a user has rated reduces to a tiny percentage making the correlation coefficient less reliable

2. User profiles change quickly and the entire system model had to be recomputed which is both time and computationally expensive

To cater to these issues, we will use ITEM-ITEM collaborative filtering.

ITEM-ITEM collaborative filtering

ITEM-ITEM collaborative filtering look for items that are similar to the articles that user has already rated and recommend most similar articles. But what does that mean when we say item-item similarity? In this case we don’t mean whether two items are the same by attribute like Fountain pen and pilot pen are similar because both are pen. Instead, what similarity means is how people treat two items the same in terms of like and dislike.

This method is quite stable in itself as compared to User based collaborative filtering because the average item has a lot more ratings than the average user. So an individual rating doesn’t impact as much.

To calculate similarity between two items, we looks into the set of items the target user has rated and computes how similar they are to the target item i and then selects k most similar items. Similarity between two items is calculated by taking the ratings of the users who have rated both the items and thereafter using the cosine similarity function mentioned below:

Once we have the similarity between the items, the prediction is then computed by taking a weighted average of the target user’s ratings on these similar items. The formula to calculate rating is very similar to the user based collaborative filtering except the weights are between items instead of between users. And we use the current users rating for the item or for other items, instead of other users rating for the current items.

With that let’s start with the coding part.

The representation of the code below might not be very easy to read, so please go to my GitHub repository to access all the codes of Recommender Systems of this series.

Again, let’s start by firing up the required libraries and importing the datasets.

import pandas as pd
import numpy as np
import math

Ratings=pd.read_csv(“C:\Users\Ankur.Tomar\Desktop\Courses\Recommender\Item-Item\ii-assignment\data\\ratings.csv”)
Movies=pd.read_csv(“C:\Users\Ankur.Tomar\Desktop\Courses\Recommender\Item-Item\ii-assignment\data\\movies.csv”)
Tags=pd.read_csv(“C:\Users\Ankur.Tomar\Desktop\Courses\Recommender\Item-Item\ii-assignment\data\\tags.csv”)

Calculating the mean rating and subtracting from each rating of a user to calculate the adjusted rating.

Mean= Ratings.groupby([‘movieId’], as_index = False, sort = False).mean().rename(columns = {‘rating’: ‘rating_mean’})[[‘movieId’,’rating_mean’]]
Ratings = pd.merge(Ratings,Mean,on = ‘movieId’, how = ‘left’, sort = False)
Ratings[‘rating_adjusted’]=Ratings[‘rating’]-Ratings[‘rating_mean’]

Calculating the similarity value for each movie user has not rated to movies user has rated and selecting 20 most similar movies. Please note that for testing purpose, I have calculated the similarity values for only one user. Add one more loop to calculate it for all the users.

movie_data_all_append=pd.DataFrame()

user_data= Ratings[Ratings[‘userId’]!=320]
distinct_movies=np.unique(user_data[‘movieId’])
i=1
for movie in distinct_movies:

if i%10==0:

print i , “out of “, len(distinct_movies)

movie_data_all=pd.DataFrame()

movie_data = Ratings[Ratings[‘movieId’]==movie]
movie_data = movie_data[[‘userId’,’movieId’,’rating_adjusted’]].drop_duplicates()
movie_data=movie_data.rename(columns={‘rating_adjusted’:’rating_adjusted1'})
movie_data=movie_data.rename(columns={‘movieId’:’movieId1'})
movie1_val=np.sqrt(np.sum(np.square(movie_data[‘rating_adjusted1’]), axis=0))

user_data1= Ratings[Ratings[‘userId’]==320]
distinct_movies1=np.unique(user_data1[‘movieId’])



for movie1 in distinct_movies1:

movie_data1 = Ratings[Ratings[‘movieId’]==movie1]
movie_data1 = movie_data1[[‘userId’,’movieId’,’rating_adjusted’]].drop_duplicates()
movie_data1=movie_data1.rename(columns={‘rating_adjusted’:’rating_adjusted2'})
movie_data1=movie_data1.rename(columns={‘movieId’:’movieId2'})
movie2_val=np.sqrt(np.sum(np.square(movie_data1[‘rating_adjusted2’]), axis=0))

movie_data_merge = pd.merge(movie_data,movie_data1[[‘userId’,’movieId2',’rating_adjusted2']],on = ‘userId’, how = ‘inner’, sort = False)

movie_data_merge[‘vector_product’]=(movie_data_merge[‘rating_adjusted1’]*movie_data_merge[‘rating_adjusted2’])

movie_data_merge= movie_data_merge.groupby([‘movieId1’,’movieId2'], as_index = False, sort = False).sum()

movie_data_merge[‘dot’]=movie_data_merge[‘vector_product’]/(movie1_val*movie2_val)

movie_data_all = movie_data_all.append(movie_data_merge, ignore_index=True)


movie_data_all= movie_data_all[movie_data_all[‘dot’]<1]
movie_data_all = movie_data_all.sort([‘dot’], ascending=False)
movie_data_all = movie_data_all.head(20)

movie_data_all_append = movie_data_all_append.append(movie_data_all, ignore_index=True)
i=i+1

Finally calculating the predicted rating for the movies:

movie_rating_all=pd.DataFrame()

for movie in distinct_movies[313:314]:
movie_nbr=movie_data_all_append[movie_data_all_append[‘movieId1’]==movie]
movie_mean = Ratings[Ratings[‘movieId’]==movie]
mean = movie_mean[“rating”].mean()
movie_nbr_dot = pd.merge(user_data1,movie_nbr[[‘dot’,’movieId2',’movieId1']], how = ‘inner’,left_on=’movieId’, right_on=’movieId2', sort = False)
movie_nbr_dot[‘wt_rating’]=movie_nbr_dot[‘dot’]*movie_nbr_dot[‘rating_adjusted’]
movie_nbr_dot[‘dot_abs’]=movie_nbr_dot[‘dot’].abs()
movie_nbr_dot= movie_nbr_dot.groupby([‘movieId1’], as_index = False, sort = False).sum()[[‘movieId1’,’wt_rating’,’dot_abs’]]
movie_nbr_dot[‘Rating’]=(movie_nbr_dot[‘wt_rating’]/movie_nbr_dot[‘dot_abs’])+mean

movie_rating_all = movie_rating_all.append(movie_nbr_dot, ignore_index=True)

movie_rating_all = movie_rating_all.sort([‘Rating’], ascending=False)

With this we are done with making a recommendation system based on the item similarities which overcome the shortcomings of user based recommendation system.

For more in depth understanding, please go through the University of Minnesota’s Recommender system specialisation courses on Coursera.

Thanks!

--

--

Ankur Tomar

Data Science | Machine Learning | EXL Services | MS Business Analytics @ University of Minnesota