The easy guide for building python collaborative filtering recommendation system in 2017

One of the most famous and powerful recommendation approaches is called collaborative filtering (CF). CF predicts items based on the history of ratings that a user gave and the history of rating given to an item.

We will use a library called Surprise to build a simple collaborative filtering recommendation system on the MovieLens 100K Dataset.

Preparing the data

Download the dataset which includes 100,000 ratings from 1000 users on 1700 movies and place it in your current working directory.

Note: Surprise has this dataset built in but we are downloading the dataset from scratch so you can easily adapt the tutorial to your own dataset

We use zipfile to extract the content of the file under a folder named ml-100k in the current directory . You can skip this step by unzipping the folder your preferred method.

import zipfile
zipfile = zipfile.ZipFile('ml-100k.zip', 'r')
zipfile.extractall()
zipfile.close()

In the unzipped folder we are interested in the u.data file that contains all the user-item ratings. The u.data file is a list of lines where each line represents a rating from a user to an item and the time when the rating happened. The format of each line is userID itemID rating timestamp with a tab distance \t between them. Each rating is on a separate line. You can format your own data in a similar manner with or without a timestamp and it will be ready to use in Surprise. For illustration, the first 5 entries of our dataset as follows

['196\t242\t3\t881250949\n', 
'186\t302\t3\t891717742\n',
'22\t377\t1\t878887116\n',
'244\t51\t2\t880606923\n',
'166\t346\t1\t886397596\n']

Splitting data

For Surprise to be able to read my data I create a Reader and define its format. In this case each line is divided as user item rating timestamp and is seperated by a tab \t. After we define the format we load our data in a Dataset object

from surprise import Reader, Dataset
# Define the format
reader = Reader(line_format='user item rating timestamp', sep='\t')
# Load the data from the file using the reader format
data = Dataset.load_from_file('./ml-100k/u.data', reader=reader)

Surprise provides a convenient way to do cross validation by dividing the data set into different folds right from the beginning. The idea behind folds is to apply cross validation to the data where training is done on all folds except one and results scoring is done on the remaining fold. So for example suppose that we have a training set of 1000 ratings and we divided it into 5 folds. Then we will train our data 5 times each time on 4 different folds and we report the results on the 5th fold. The final score of our model will be the average of the 5 folds results. The folding of data is done using the split function as follows

# Split data into 5 folds
data.split(n_folds=5)

Optimization

The way training happens is similar to other machine learning approaches where an algorithm try to optimize its predictions to match as closely as possible the actual results. So in the context of collaborative filtering, our algorithm will try to predict the rating of a certain user-movie combination and it will compare that prediction to the actual prediction. The difference between the actual and the predicted rating is measured using classical error measurements such as Root mean squared error (RMSE) and Mean absolute error (MAE).

In surprise we have a wide choice of algorithms to use and a wide choice of parameters to tune for each algorithm. From the famous available algorithms I mention SVD, NMF, KNN. For the purpose of our example we will use the SVD algorithm. For the sake of simplicity we will use the algorithm as is without changing any of its parameters.

Surprise also supports the RMSE and MAE measurements so we will use those to measure the performance of our algorithm.

from surprise import SVD
algo = SVD()
evaluate(algo, data, measures=['RMSE', 'MAE'])

And the results of the algorithm are as follows

------------
Fold 1
RMSE: 0.9406
MAE: 0.7408
------------
Fold 2
RMSE: 0.9338
MAE: 0.7378
------------
Fold 3
RMSE: 0.9399
MAE: 0.7397
------------
Fold 4
RMSE: 0.9419
MAE: 0.7426
------------
Fold 5
RMSE: 0.9395
MAE: 0.7417
------------
------------
Mean RMSE: 0.9392
Mean MAE : 0.7405
------------
------------

Predicting

Finally we are interested in predicted rating for a certain user-movie combination so we can know whether the user will like this movie or not.

First we will train on the whole data set without splitting the data to get the best results possible

# Retrieve the trainset.
trainset = data.build_full_trainset()
algo.train(trainset)

Afterwards to predict a rating we give the user ID, item ID and the actual rating as follows

userid = str(196)
itemid = str(302)
actual_rating = 4
print algo.predict(userid, 302, 4)

And from the results we can see that the expected rating is 4.18 compared to the actual rating of 4

user: 196        item: 302        r_ui = 4.00   est = 4.18 {u'was_impossible': False}

The full code is found in this gist so you can easily copy/paste it