Recommendation System using collaborative filtering in Python

Saket Garodia
Analytics Vidhya
Published in
6 min readJan 2, 2020

This blog illustrates a Collaborative-Filtering based recommender system in python

Before starting with the implementation of Metadata-Based Recommender systems in python, I will recommend you to give a short 4-min read to this blog which defines a recommender system and its types in laymen terms.

https://medium.com/@saketgarodia/the-world-of-recommender-systems-e4ea504341ac?source=friends_link&sk=508a980d8391daa93530a32e9c927a87

Through this blog, I will show how to implement a Collaborative-Filtering based recommender system in Python on Kaggle’s MovieLens 100k dataset.

The dataset we will be using is the MovieLens 100k dataset on Kaggle :

Let us start implementing it.

Problem formulation

To build a recommender system that recommends movies based on Collaborative-Filtering techniques using the power of other users.

Implementation

First, let us import all the necessary libraries that we will be using to make a content-based recommendation system. Let us also import the necessary data files.

We will use surprise package which has inbuilt models like SVD, KMean clustering, etc for collaborative filtering.

#importing necessary librariesimport numpy as npimport pandas as pdfrom sklearn.metrics.pairwise import cosine_similarityfrom sklearn.metrics import mean_squared_errorfrom sklearn.model_selection import train_test_splitfrom surprise import Reader, Dataset, KNNBasicfrom surprise.model_selection import cross_validatefrom surprise import SVDr_cols = ['user_id', 'movie_id', 'rating', 'timestamp']ratings = pd.read_csv('u.data',  sep='\t', names=r_cols,encoding='latin-1')ratings.head()i_cols = ['movie_id', 'title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure','Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy','Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']movies = pd.read_csv('u.item',  sep='|', names=i_cols, encoding='latin-1')movies.head()u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']users = pd.read_csv('u.user', sep='|', names=u_cols,encoding='latin-1')users.head()

So, we have 1682 unique movies and 100,000 total ratings for these unique movies by 943 users.

Now, we need to split our ‘ratings’ data frame into two parts — part 1 to train the algorithm to predict ratings and part 2 to test whether the rating predicted is close to what was expected. This will help in evaluating our models.

We will take y as ‘user_id’ just to ensure that the splitting leads to stratified sampling and we have all the user_ids in the training set to make our algorithm powerful.

#Assign X as the original ratings dataframe and y as the user_id column of ratings.X = ratings.copy()y = ratings[‘user_id’]#Split into training and test datasets, stratified along user_idX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify=y, random_state=42)

We have 75k ratings in the training set and 25k in the test set to evaluate our models.

df_ratings = X_train.pivot(index=’user_id’, columns=’movie_id’, values=’rating’)Now, our df_ratings dataframe is indexed by user_ids with movie_ids belonging to different columns and the values are the ratings with most of the values as Nan as each user watches and rates only few movies. Its a sparse dataframe.

Here’s how our sparse rating data frame looks:

Now, we will use 2 different methods for collaborative filtering. In the first method, we will use the weighted average of the ratings and we will implement the second method using model-based classification approaches like KNN (K nearest neighbors)and SVD (Singular Value Decomposition). We will talk about KNN and SVD later.

In the 1st method, we will use the weighted avg of the ratings using cosine similarity as the weights. The users who are more similar to the input_user will have a higher weight in our rating computation for the input_user.

Let’s first replace the NULL values by 0s since the cosine_similarity doesn’t work will NA values and let us proceed to build the recommender function using the weighted average of ratings.

Method 1: Weighted Average approach

df_ratings_dummy = df_ratings.copy().fillna(0)df_ratings_dummy.head()
#cosine similarity of the ratings
similarity_matrix = cosine_similarity(df_ratings_dummy, df_ratings_dummy)similarity_matrix_df = pd.DataFrame(similarity_matrix, index=df_ratings.index, columns=df_ratings.index)#calculate ratings using weighted sum of cosine similarity#function to calculate ratings
def calculate_ratings(id_movie, id_user):if id_movie in df_ratings:cosine_scores = similarity_matrix_df[id_user] #similarity of id_user with every other userratings_scores = df_ratings[id_movie] #ratings of every other user for the movie id_movie#won't consider users who havent rated id_movie so drop similarity scores and ratings corresponsing to np.nanindex_not_rated = ratings_scores[ratings_scores.isnull()].indexratings_scores = ratings_scores.dropna()cosine_scores = cosine_scores.drop(index_not_rated)#calculating rating by weighted mean of ratings and cosine scores of the users who have rated the movieratings_movie = np.dot(ratings_scores, cosine_scores)/cosine_scores.sum()else:return 2.5return ratings_movie

Now that we have written a function to calculate the rating given a user and a movie, let's see how it performs on a test set.

calculate_ratings(3,150) #predicts rating for user_id 150 and movie_id 3

2.9926409218795715

Let's build a function score_on_test_set that evaluates our model on the test set using root_mean_squared_error

#evaluates on test setdef score_on_test_set():user_movie_pairs = zip(X_test[‘movie_id’], X_test[‘user_id’])predicted_ratings = np.array([calculate_ratings(movie, user) for (movie,user) in user_movie_pairs])true_ratings = np.array(X_test[‘rating’])score = np.sqrt(mean_squared_error(true_ratings, predicted_ratings))return score
test_set_score = score_on_test_set()
print(test_set_score)

The mean squared error on the test set is 1.0172.

The test_set’s root mean square error is 1.01 which is kind of amazing. This means our algorithm worked really well in predicting the movie ratings for new users using a weighted average of ratings. Let’s now use the model-based approaches and see how far we can improve the root mean square error.

Method 1: Model-based approaches

In the model-based approach, we will use 2 models: KNN and SVD. The surprise package has inbuilt libraries with different models to build recommender systems and we will use the same.

In the KNN based approach, the prediction is done by finding a cluster of similar users to the input_user whose rating is to be predicted and then an average of those ratings is taken. KNN is a famous classification algorithm.

In the SVD (Singular Value decomposition) method, the sparse user-movie ( ratings) matrix is compressed into a dense matrix by applying matrix factorization techniques. If M is a user* movie matrix, SVD decomposes it into 3 parts: M = UZV, where U is user concept matrix, Z is weights of different concepts and V is concept movie matrix. ‘Concept’ can be intuitively understood by imagining it as a superset of similar movies like a ‘suspense thriller’ genre can be a concept, etc.

Once SVD decomposes the original matrix into 3, the dense matrix is directly used for predicting the rating for a (user, movie) pair using the concept to which the input_movie belongs.

# installing surprise library!pip install surprise#Define a Reader object#The Reader object helps in parsing the file or dataframe containing ratingsratings = ratings.drop(columns=’timestamp’)reader = Reader()#dataset creationdata = Dataset.load_from_df(ratings, reader)#modelknn = KNNBasic()#Evaluating the performance in terms of RMSEcross_validate(knn, data, measures=[‘RMSE’, ‘mae’], cv = 3)

We can see that the root_mean_square error in the case of KNN has even further reduced to 0.98 compared to the weighted mean approach. KNN is definitely performing better than the weighted mean approach to predict movie ratings.

Now, let's see how SVD performs.

#Define the SVD algorithm objectsvd = SVD()#Evaluate the performance in terms of RMSEcross_validate(svd, data, measures=[‘RMSE’], cv = 3)

The error has even further reduced to ‘rmse’ value of 0.948 which is kind of the best result among the 3 approaches we used.

trainset = data.build_full_trainset()svd.fit(trainset)ratings[ratings[‘user_id’] == 5]
svd.predict(1, 110)

The prediction for user_id 1 and movie 110 by SVD model is 2.14 and the actual rating was 2 which is kind of amazing.

To know about the Content and Metadata-based approaches, go through my following blogs:

  1. Content-based Recommender Systems: https://medium.com/@saketgarodia/content-based-recommender-systems-in-python-2b330e01eb80?
  2. Meta-data based Recommender Systems: https://medium.com/@saketgarodia/metadata-based-recommender-systems-in-python-c6aae213b25c

Thank you.

--

--

Saket Garodia
Analytics Vidhya

Senior Data Scientist at 84.51(Kroger), AI/Data Science, Psychology, economics, books; Linkedin — https://www.linkedin.com/in/saket-garodia/