How to build a Book Recommendation System

SenapatiRajesh
10 min readFeb 14, 2023

--

Source: httpss26162.pcdn.cowp-contentuploads202002book-3480216_1920.jpg

What is a book recommendation system?

A book recommendation system is like a personalized librarian or book expert who helps you find new books to read.

Imagine you’re at a library and you’re looking for a new book to read. A recommendation system would ask you about the types of books you like, the authors you enjoy, and the genres you prefer. Then, it would use that information to suggest new books that you might like, based on what other people with similar tastes have liked in the past.

This is essentially what a book recommendation system does, but it’s all done automatically and online. By analyzing your reading preferences and behavior, the system is able to recommend new books that are tailored to your individual tastes and interests. This can help you discover new books you might not have otherwise considered, and make it easier to find books that you’re sure to enjoy.

Use cases?

There have been several companies that have seen a significant increase in sales as a result of adopting book recommendation systems. Here are a few examples:

Source:https://www.wikihow.com/images/thumb/b/bb/Start-Reading-eBooks-Step-3.jpeg/v4-460px-Start-Reading-eBooks-Step-3.jpeg
  1. Amazon: Amazon has reported that its personalized recommendations drive 35% of its total sales, making it one of the most successful e-commerce companies in terms of using book recommendation systems. Amazon’s recommendation system takes into account a user’s purchase history, search history, and ratings and reviews to provide personalized recommendations.
  2. Goodreads: Goodreads is a social network for book lovers that uses a combination of content-based and collaborative filtering to recommend books to its users. By providing users with relevant recommendations, Goodreads has been able to increase engagement and drive sales for both traditional publishers and independent authors.
  3. Barnes & Noble: Barnes & Noble, one of the largest book retailers in the United States, has also seen a significant increase in sales as a result of adopting book recommendation systems. The company’s recommendation system takes into account a user’s purchase history, ratings and reviews, and the preferences of other users to provide personalized recommendations.
  4. Scribd: Scribd is an online reading subscription service that uses collaborative filtering to recommend books to its users. By providing users with relevant recommendations, Scribd has been able to increase engagement and drive sales for publishers and authors.

These are just a few examples of companies that have seen a significant increase in sales as a result of adopting book recommendation systems. The use of recommendation systems continues to grow, as companies seek to provide their customers with the most relevant and personalized recommendations possible.

So Let’s build the recommendation system. Below approach will be followed in building recommendation system using various approaches. Please keep reading till the very end to learn it briefly.

Let’s start.

Approach?

  1. Data importing, cleaning and making it ready for building recommendation system
  2. Little bit EDA
  3. Building recommendation system using

a. Popularity w.r.t User Ratings

b. Popularity w.r.t Places

c. Popularity w.r.t Author, Publisher of given book name

d. Popularity w.r.t year

e. Collaborative filtering(User-Item Filtering)

f. Content based filtering

Execution:

  1. Data importing, cleaning and making it ready for building recommendation system
#Importing Required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
#Reading data
df_ratings=pd.read_csv(r"C:\Users\RajeshS\Real projects\Projects\Recommendation\Book\Ratings_df.csv",encoding='latin-1')
df_books=pd.read_csv(r"C:\Users\RajeshS\Real projects\Projects\Recommendation\Book\Books_df.csv",encoding='latin-1')
df_user=pd.read_csv(r"C:\Users\RajeshS\Real projects\Projects\Recommendation\Book\users_df.csv",encoding='latin-1')

Data set was taken from kaggle.Below is the link of the same

Skip this part if you have basic knowledge of libraries used in above code:

Reading a CSV file named ‘BX-Book-Ratings.csv’ into a pandas DataFrame named df_ratings. The encoding parameter is set to ‘latin-1’, which indicates the encoding format of the file. This is useful if the file contains characters that are not part of the ASCII character set.

The DictVectorizer is a utility class for converting categorical data stored in dictionaries into numerical feature vectors suitable for use with scikit-learn algorithms.DictVectorizer class automatically converts categorical variables into binary features using one-hot encoding, so each unique value for each categorical variable is represented as a separate binary feature in the resulting matrix.

The cosine_similarity function calculates the cosine similarity between two sets of vectors, which is a measure of the similarity between two non-zero vectors of an inner product space. The cosine similarity is defined as the cosine of the angle between the two vectors.The result is a 2x2 matrix containing the cosine similarity between each pair of vectors, where each value is a number between -1 and 1, with 1 indicating that the vectors are identical, -1 indicating that they are completely dissimilar, and values closer to 0 indicating that they are less similar.

TF-IDF stands for Term Frequency-Inverse Document Frequency, and is a numerical statistic used to reflect how important a word is to a document in a collection of documents. The basic idea is to scale the frequency of a word in a document by the inverse of its frequency across all documents, so that words that are common across many documents are given less weight and words that are unique to a particular document are given more weight.

# Checking first 5 rows of each file
df_books.head()
df_ratings.head()
df_user.head()
#output
books shape (271360, 8)
ratings (1149780, 3)
users (278858, 3)
# Calcuating missing value percentage in books dataset
missing_values_percentage=(df_books.isnull().sum()/df_books.shape[0])*100
missing_values_percentage.sort_values(ascending=False)
#output
Image-URL-L 0.001106
Publisher 0.000737
Book-Author 0.000369
ISBN 0.000000
Book-Title 0.000000
Year-Of-Publication 0.000000
Image-URL-S 0.000000
Image-URL-M 0.000000

In this part,

  1. I have checked all of the datasets column shape, dtype and description
  2. Checked column wise missing values and their percentage
  3. In year of publication column, found 0,DK publishing Inc, Gallimard and string type years which were corrected using replacing method
  4. In year of publication column, zero year found which is due to wrong date format. I have corrected it using
#There must be some issue with format.Let's do correction of the same using numeric
books['Year-Of-Publication']=pd.to_numeric(books['Year-Of-Publication'],errors='coerce')

The code is converting the ‘Year-Of-Publication’ column of a Pandas DataFrame called ‘books’ to numeric values using the pd.to_numeric function. The errors=’coerce’ argument tells the function to replace any non-numeric values with NaN (Not a Number) values.

This line of code is useful when working with data where the ‘Year-Of-Publication’ column may contain values that are not numbers, but need to be converted to numeric values for further analysis. The pd.to_numeric function makes it easy to convert a column to numeric values and handle any non-numeric values appropriately.

5. Then I took care of outliers in data using

#converting values above 2021 year and 0 year to nan
books.loc[(books['Year-Of-Publication']>2021)|(books['Year-Of-Publication']==0),'Year-Of-Publication']=np.nan
#filling nan values with median
books['Year-Of-Publication'].fillna(round(books['Year-Of-Publication'].median()),inplace=True)
#1378 and 1376 year changed to 2000
books.loc[books['Year-Of-Publication']==1378,:]
books.loc[books['ISBN']==9643112136,'Year-Of-Publication']=2000
books.loc[books['Year-Of-Publication']==1376,:]
books.loc[books['ISBN']=='964442011X','Year-Of-Publication']=2000

6. Created country column from Location column

for i in users:
users['country']=users.Location.str.extract(r'\,+\s?(\w*\s?\w*)\"*$')
#Let's drop Location column
users.drop('Location',axis=1,inplace=True)
users['country']=users['country'].astype('str')
#Let's replace wrong items in list
users['country'].replace(['','01776','02458','19104','23232','30064','85021','87510','alachua','america','austria','autralia','cananda','geermany','italia','united kindgonm','united sates','united staes','united state','united states','us'],
['other','usa','usa','usa','usa','usa','usa','usa','usa','usa','australia','australia','canada','germany','italy','united kingdom','usa','usa','usa','usa','usa'],inplace=True)

The code adds a new column to the dataframe called “country”, which is extracted from the “Location” column.The code uses the str.extract method of the Pandas library, which allows you to extract a pattern from a string.This pattern matches a string that starts with one or more commas (,+), optionally followed by whitespace (\s?), and then one or more word characters (\w) that are optionally separated by whitespace (\s?\w). The pattern ends with an optional double quote (“*) and the end of the string ($).So, for example, if the “Location” column of a row contains the string “Boston, MA, USA”, the code will extract the string “USA” and store it in the “country” column of that row. We have dropped Location and converted each country to string and replaced all the wrong countries.

7. Age is having more outliers and corrected using below code

users.loc[(users.Age>100)|(users.Age<5),'Age']=np.nan
users['Age']=users['Age'].fillna(users.groupby('country')['Age'].transform('median'))
#fill remaining null values with mean
users['Age']=users['Age'].fillna(users['Age'].mean())

The code starts by calling the fillna method on the “Age” column. The missing values are filled with the median age of the same country, which is calculated using the groupby and transform methods and filled remaining fill remaining null values using mean method

8. Other preprocessing you can see in full code as explaining all of them will lead to lengthy blog.

Building recommendation System:

Here I am going to explain only one approach, you can get the remaining approaches in the link given below

Collaborative Filtering (User-Item Filtering) approach:

df=pd.DataFrame(Final_dataset['Book-Title'].value_counts())
df['Total-Ratings']=df['Book-Title']
df['Book-Title']=df.index
df.reset_index(level=0,inplace=True)
df=df.drop('index',axis=1)
df=Final_dataset.merge(df,left_on='Book-Title',right_on='Book-Title',how='left')
df=df.drop(['Year-Of-Publication','Publisher','Age','country'],axis=1)

We are trying to create a new Pandas DataFrame based on the Final_dataset DataFrame. The code starts by counting the number of occurrences of each book title in the Final_dataset DataFrame and storing the result in a new DataFrame df. This is done using the value_counts() method on the Book_Title column.

The code then adds a new column to the df DataFrame, called Total-Ratings, which is the same as the book_title column. The column book_title is then renamed to be the index of the DataFrame. The reset_index() method is used to reset the index to a default range index, and the previous index column is dropped using the drop() method.

Finally, the code merges the df DataFrame with the Final_dataset DataFrame using the merge() method. The merge is done on the book_title columns of both DataFrames, using the left_on and right_on parameters. After the merge, the code drops the columns year_of_publication, publisher, Age, and country from the merged DataFrame using the drop() method.

popularity_threshold=50
popular_book=df[df['Total-Ratings']>=popularity_threshold]
popular_book=popular_book.reset_index(drop=True)

This code is filtering the DataFrame df based on the Total-Ratings column, to obtain a new DataFrame popular_book containing only the books with more than popularity_threshold ratings. The value of popularity_threshold is set to 50.

The code does this by using boolean indexing, where df[‘Total-Ratings’] >= popularity_threshold returns a boolean array indicating which rows meet the condition. The resulting DataFrame popular_book contains only the rows where this boolean array is True.

Finally, the code resets the index of the popular_book DataFrame using the reset_index() method, and the drop parameter is set to True to drop the original index. This results in a DataFrame with a default range index.

testdf = pd.DataFrame()
testdf['ISBN'] = popular_book['ISBN']
testdf['Book-Rating'] = popular_book['Book-Rating']
testdf['User-ID'] = popular_book['User-ID']
testdf = testdf[['User-ID','Book-Rating']].groupby(testdf['ISBN'])

Here we are creating a new DataFrame testdf from the popular_book DataFrame, and selecting only the columns ISBN, Book-Rating, and User-ID.

Then, the code groups the testdf DataFrame by the ISBN column, using the groupby() method. The groupby() method returns a DataFrameGroupBy object, which allows grouping the data based on the values in the ISBN column. The resulting object groups the rows of the testdf DataFrame based on the ISBN column, and the grouped data can be aggregated in various ways, such as calculating the mean, sum, or count for each group.

Note that the code is also selecting only the columns User-ID and Book-Rating from the grouped DataFrame using the square brackets ([]) indexing notation.

listOfDictonaries=[]
indexMap = {}
reverseIndexMap = {}
ptr=0

for groupKey in testdf.groups.keys():#ISBN
tempDict={}
groupDF = testdf.get_group(groupKey)
for i in range(0,len(groupDF)):
tempDict[groupDF.iloc[i,0]] = groupDF.iloc[i,1]
indexMap[ptr]=groupKey
reverseIndexMap[groupKey] = ptr
ptr=ptr+1
listOfDictonaries.append(tempDict)

we are iterating over the groups in the testdf DataFrameGroupBy object, and creating a list of dictionaries, where each dictionary represents the ratings given by a set of users to a particular book.

The indexMap dictionary is mapping the index of each book in the list of dictionaries to its ISBN, and the reverseIndexMap dictionary is mapping the ISBN of each book to its index in the list of dictionaries.

For each group in the testdf DataFrameGroupBy object, represented by the groupKey, the code uses the get_group() method to retrieve the DataFrame that contains all the rows of the group. Then, it creates a temporary dictionary tempDict to store the user IDs and ratings of the books in the group.

The code then adds the tempDict to the listOfDictonaries list and increments the ptr variable, which is used as the index of each book in the listOfDictonaries list. Finally, the indexMap and reverseIndexMap dictionaries are updated with the ISBN and index of the book, respectively.

dictVectorizer = DictVectorizer(sparse=True)
vector = dictVectorizer.fit_transform(listOfDictonaries)
pairwiseSimilarity = cosine_similarity(vector)

The DictVectorizer is a utility class provided by scikit-learn that can be used to convert a list of dictionaries into a numerical feature matrix, where each dictionary is represented as a row in the matrix.

In this code, dictVectorizer is an instance of the DictVectorizer class, and it is initialized with sparse=True, which means that the feature matrix created from the list of dictionaries will be in sparse format, and only the non-zero values will be stored.

The code then calls the fit_transform method on the dictVectorizer instance, passing in the listOfDictonaries list, to create the numerical feature matrix vector.

Finally, the code computes the pairwise cosine similarity between all the rows in the feature matrix vector using the cosine_similarity function from scikit-learn. The result is stored in the pairwiseSimilarity variable.

The cosine similarity is a measure of the similarity between two non-zero vectors of an inner product space, with a range of [-1, 1]. The cosine similarity between two vectors is the cosine of the angle between them, and it is defined as the dot product of the vectors divided by the product of the magnitude of the vectors. In this case, the vectors represent the books, and the cosine similarity between two books is a measure of how similar their ratings are, with a value of 1 indicating that the ratings are exactly the same, and a value of -1 indicating that the ratings are completely different.

k = list(Final_dataset['Book-Title'])
m = list(Final_dataset['ISBN'])
bookName='Harry Potter and the Chamber of Secrets (Book 2)'
collaborative = getTopRecommandations(m[k.index(bookName)])

#Output
Input Book:
Harry Potter and the Chamber of Secrets (Book 2)

RECOMMENDATIONS:

Harry Potter and the Sorcerer's Stone (Book 1)
Harry Potter and the Goblet of Fire (Book 4)
Harry Potter and the Order of the Phoenix (Book 5)
Harry Potter and the Prisoner of Azkaban (Book 3)
The Two Towers (The Lord of the Rings, Part 2

Finally using above code , we can select a book from dataset and then using the ISBN of that book we can generate a list of top recommendations using a collaborative filtering algorithm.

Thank you for keeping till last!!!Please upvote if you like the explanation!!1

You can get full code in below link

--

--