My Journey to building Book Recommendation System. . .
Recommendation systems have been keeping my mind occupied for quite a while, and owing to my inclination for reading books, exploring Book Crossing dataset was very much engaging.
Online recommendation systems are the in thing to do for many e-commerce websites. A recommendation system broadly recommends products to customers best suited to their tastes and traits. For more details on recommendation systems, read my introductory post on Recommendation Systems and a few illustrations using Python.
My journey to building Book Recommendation System began when I came across Book Crossing dataset. This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1–10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.
Before building any machine learning model, it is vital to understand what the data is, and what are we trying to achieve. Data exploration reveals the hidden trends and insights and data preprocessing makes the data ready for use by ML algorithms.
So, let’s begin. . .
First, we load the dataset and check the shapes of books, users and ratings dataset as below:
Books
Exploring each of these datasets one by one and beginning with books dataset, we can see that image URLs columns do not seem to be required for analysis, and hence these can be dropped off.
We now check the data types for each of the columns, and correct the missing & discrepant entries. I am also adjusting the column width to display full text of columns.
yearOfPublication
Now we check the unique values for this attribute.
There are some incorrect entries in yearOfPublication. It looks like publisher names ‘DK Publishing Inc’ and ‘Gallimard’ have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file. Also, some of the values are strings and same years have been entered as numbers at some places. We will make necessary correction for these rows and set the data type for yearOfPublication as int.
It can now be seen that yearOfPublication is of type int and it has values ranging from 0–2050. As this dataset was built in 2004, I am assuming all the years after 2006 are invalid keeping a margin of two years in case dataset may have been updated. For all the invalid entries (including 0), I will convert these to NaNs, and then replace them with mean values of remaining years.
publisher
Coming to ‘publisher’ column, I have handled two NaN values by replacing them with ‘other’ as publisher name could not be inferred after some investigations (check jupyter notebook embed).
Users Dataset
Now we explore users dataset, firstly by checking its shape, first few columns and data types.
Age
Upon checking the unique values, userID looks correct. However, Age column has a NaN and some very high values. In my view ages below 5 and above 90 do not make much sense, and hence, these are being replaced with NaNs. All the NaNs are then replaced with mean value of Age, and its data type is set as int.
I am not doing any processing of Location column here. However, if you wish you can further split this into city, state, country and do some processing using text processing models.
Ratings Dataset
We check the ratings dataset for its shape and first few rows. It reveals that our user-book ratings matrix will be very sparse as actual ratings are quite less as compared to size of ratings matrix (number of users × number of books).
Now ratings dataset should have userID and ISBN which exist in respective tables, viz. users and books.
It is evident that, users have rated some of the books, which are not part of original books dataset. Sparsity of the dataset can be calculated as below:
The explicit ratings represented by 1–10 and implicit ratings represented by 0 will have to be segregated now. We will be using only explicit ratings for building our book recommendation system. Similarly, users are also segregated into those who rated explicitly and those whose implicit behavior was recorded.
A countplot of bookRating indicates that higher ratings are more common amongst users and rating 8 has been rated highest number of times.
Simple Popularity based Recommendation System
At this point, a simple popularity based recommendation system can be built based on count of user ratings for different books. It is evident that books authored by J.K. Rowling are quite popular.
Collaborative Filtering based Recommendation System
To cope up with computing power my machine has and to reduce the dataset size, I am considering users who have rated at least 100 books and books which have at least 100 ratings.
Next key step in building CF-based recommendation systems is to generate user-item ratings matrix from the ratings table.
Notice that most of the values in ratings matrix are NaNs indicating absence of ratings and hence sparsity of data. Also, note that only explicit ratings have been considered here. As most of the machine learning algorithms cannot handle NaNs, we replace them with 0, which now indicates absence of rating.
User-based CF
I will be reusing the functions from my post CF based Recommendation Systems Exemplified. The function findksimilarusers inputs userID and ratings matrix and returns similarities and indices of k similar users. (Read my previous stories to understand the concept and formulae of user/item based CF approaches)
The function predict_userbased predicts rating for specified user-item combination based on user-based approach.
The function recommendItem uses above functions to recommend books for user-based or item-based approach (based on selected approach and metric combination). Recommendations are made if the predicted rating for a book is greater than or equal to 6, and the books have not been rated already. You can select the similarity metric (cosine/ correlation) while calling this function.
And Voila!!! Check the top 10 book recommendations for user 4385 based on user-based CF approach.
Item-based CF
Similar functions have been written for Item-based CF to find k similar books and predict the user’s ratings for every books. Same function recommendItem can be used to recommend books based on item-based approach and selected metric. Recommendations are made if the predicted rating for a book is greater than or equal to 6, and the books have not been rated already.
Wow!!! Check the top 10 book recommendations for user 4385 based on item-based CF approach. These are significantly different from those suggested by user-based approach.
In this post, areas like cross validation, test-train split, and evaluation of recommendation systems have not been covered and these areas are worth exploring. Jupyter notebook for this code is embedded below.
Thanks for reading! I hope you liked this article. Please share your views in comments section below. Meanwhile, I will go and check some book recommendations for myself.
References: