My kind of BOOKs

Published in

Web Mining [IS688, Spring 2021]

11 min readMar 9, 2021

A step-by-step guide for creating a book recommendation system using Cosine similarity in python.

Reading is an inevitable part of human lives, where books play an important role. Unexpectedly, books allow readers to learn, imagine, feel numerous emotions without moving their feet. The capability of books transforming one’s life is very intriguing to me.

While online recommendation systems are contributing to choosing your kind of stuff for example a good movie recommender will help the viewer to view his/her choice of movies. But there has not been any good recommendation system for readers who love books.

How long you have spent your time thinking about which book to read next? Or how many times you doubted yourself whether to choose fiction or non-fiction, humor or romantic?

With my inclination towards exploring books, I decided to create a book recommender system. There is an uncountable number of books that have become popular over time, but as human beings, we can’t read them all but at least we have the choice to decide which book we will prefer reading. In the following post, we will analyze and figure out the relevant books for a book reader.

The following topics would be covered as an outline for this blog:

Data cleaning and preparation
Analysis of the book data
Creation of two recommender systems: Content-based recommender & Collaborative Filtering.

So let us Begin.

Python Libraries and dependencies

For this project, we will be requiring to install and import the following libraries :

# Import libraries and packagesimport numpy as npimport seaborn as snsimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarityfrom surprise import SVD,Reader,Datasetfrom surprise.model_selection import cross_validatefrom surprise.model_selection import KFoldimport requestsfrom PIL import Imagefrom io import BytesIO

Pandas and NumPy as used for data preprocessing and basic linear algebra. Seaborn and Matplotlib helped in creating visual graphics and bar plots for the dataset. Image and BytesIO packages helped in using image_url of books and print the image as output.

For model creation and feature extraction, sci-kit learn was used. Tfidfvectorizer package created the matrix of features while cosine similarity finds the similarity score between the features.

Using the surprise library from sci-kit will help us in building and analyzing recommender systems. It provides many prediction algorithms like SVD and similarity measures.

We are going to use the SVD algorithm that is equivalent to probabilistic matrix factorization it allows us to discover the features underlying the interactions between users and items.

1. Data Collection

To create a good book recommender system, I will be using the famous Kaggle dataset GoodBooks-10k Dataset. This dataset contains ratings for ten thousand popular books from the well-known page GoodReads.com. The books are rated on a scale of one to five. The dataset contains five CSV files but I will be using four of them that's are books.csv, book_tags.csv, tags.csv, and ratings.csv. There are various other datasets available on the internet which you can use to create this recommender system.

It is very vital to understand the data before creating any machine learning model. Data exploration reveals the hidden trends and insights of data and data preprocessing helps to make your data ready for any machine learning algorithm.

After loading all the datasets, check the shape of books, ratings, tags, and book_tags datasets. We will be discussing individual datasets later in the article.

Check for missing values in the datasets. There are missing values in the books dataset which is required to be removed for better analysis.

Let’s explore each dataset one by one.

Books

The books dataset contains all the information regarding the books like book id(unique id for every book), title, authors etc. goodreadsbookid and bestbookid generally point to the most popular edition of a given book. I have used image_url to fetch image as output. There are a few irrelevant pieces of information which we won't be needing like ISBN, work_id, ratings_1, ratings_2, ratings_3, ratings_4, ratings_5, small_imageurl, etc. so just drop these columns.

2. Ratings

The rating dataset has book_id, user_id, and rating. Each of these is important for our further analysis. The minimum number of ratings per book is 8 and for per user is 2.

3. Tags & Book_tags

Tags are a very important attribute of this recommender system. We will be using it further for our analysis. Tags are a kind of metadata that represents the genre, or kind of book. For example, fiction, non-fiction, books from amazon, everything comes under tags. Tag-id represents the unique ids for tags. We will be merging these two datasets with book datasets.

2. Exploratory Data Analysis and Findings

a)Let us figure out the unique publication year of the books. It is surprising to see a book from year 119 but yes it exists. The Twelve Ceasers average rating is 4.05 which is above the benchmark. This how we can find various books information and surprise us!!!

b) Publication year

From the plot, we can see that most of the books are from the year 2000 which implies the dataset has records of new books.

c) Distribution of ratings of books

We can see the dataset has highly-rated books mostly in ratings between 3 and 5. From this we have made two hypotheses :

Readers usually vote only for the books they like.
Readers are biased towards positive reviews of books.

d) Which tags are most popular?

Other than fiction, there are various other tags like kindle, favorites, to-read which are more preferred by the readers.

We can do more user analysis on the dataset like most rated books, the correlation between the features, etc.

After viewing a few analyses over the datasets, lets on move to recommender systems. Before that, I would like to briefly tell you about my query entities of interest on which I will be working and why.

Three query entities to generate a list of top 10 books!

Tag_name and Title
User ratings

Tag_name and Title entities will come under content-based recommendation systems and user ratings will come under collaborative-based recommendation systems.

3. Recommender Systems

To achieve our goal, we are going to reproduce the two most popular recommender systems: Content-based filtering and Collaborative Filtering.

“A recommender system, or a recommendation system (sometimes replacing ‘system’ with a synonym such as a platform or engine), is a subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item.”

3.1 Content-Based Filtering

“Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback.”

The Content-based recommender suggests similar items based on a particular item, it is also known as an item-based recommender. This system uses metadata such as tags, description of a book, author name, etc. to make these recommendations.

How does it “works”?

We will be using user-provided tags for book suggestions. For example, if a reader previously liked books in fiction, humor, and magic, then the recommender will suggest books similar to these tags. So tags are the content for the system to which it will provide similar books by giving the book_title.

For example, a user gives book_title as “Fault is in our stars”, as this book comes under romantic and teenage love tag, the system will recommend books which are either romantic or teenage love or both.

I have chosen Title and Tag_name as entities of interest because both of these cover vast aspects of any book.

The system will find similarities between pairs of books and then generate a list of recommendations from the most similar kinds of books. Thus to find similarities between books, we are using tag_names of the books and by computing TFIDF (Term Frequency-Inverse Document Frequency) vector for each document (each list of tag_names and titles).

TF-IDF is from the sub-area of Natural language Processing. It is a measure for evaluating the importance of words in a document. The importance of a word increases proportionally to the number of times that a particular word appears in the document. This will give a matrix where each column represents a word in the list of tags/titles vocabulary and each row represents a book.

Using this matrix, calculate the cosine similarity between two vectors where each vector contains keywords(tags) that define each book. The cosine similarity computes the score between the features.

Cosine similarity function:

By applying this function, the similarity will be a number bounded between 0 and 1 that tells us how much the two vectors are similar to each other.

“Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1.”

I have used cosine similarity because if the two similar documents are far apart by the Euclidean distance because of the size (like, the word ‘fiction’ appeared 20 times in one document and 10 times in another) they could still have a smaller angle between them. Smaller the angle, the higher the similarity!

Code: Function for recommending books using tag_names

Step1: Implement a function to find the similar books by giving the title as input
# Function to get the index of the book given its title.def get_book_id(book_title):index = books.index[books['original_title'] == book_title].to_list()if index:return index[0]else:return None# Function to get the title of a book given its id.def get_book_title(book_id):title = books.iloc[book_id]['original_title']return title# Function that takes the book title and returns the most similar books.def get_similar_books(title, n=10):# Get the book idbook_id = get_book_id(title)if book_id is None:print("Book not found.")else:# Get the pariwsie similarity scores of all books with that bookbook_similarities = list(enumerate(similarities[book_id]))# Sort the books based on the similarity scoresbook_similarities = sorted(book_similarities, key=lambda x: x[1], reverse=True)# Get the scores of the 10 most similar bookmost_similar_books = book_similarities[1:1+n]movie_indices = [i[0] for i in most_similar_books]# Top 10 book recommendationrec = books[['title', 'image_url']].iloc[movie_indices]print("For this book we will recommand you:\n")for i in rec['image_url']:response = requests.get(i)img = Image.open(BytesIO(response.content))plt.figure()print(plt.imshow(img))Step2: Extract feature in tfidf matrix
tfidf = TfidfVectorizer(stop_words='english')tfidf_matrix = tfidf.fit_transform(books['tags'])Step3: Calculate the cosine similarity score 
similarities = cosine_similarity(tfidf_matrix, tfidf_matrix)Step4: Run the function
get_similar_books("The Hobbit")

Understanding the code: tfidf_matrix extracts the features from book tags. similarities variable is a NumPy array that stores the cosine similarity score through which we fetch the book ids for the similar books to the input book.

get_similar_books() takes book_title as input and suggests books similar to the book tag names.

“The Hobbit” is one of my favorites books. I would like to see the recommendation for that.

These results are great!!

3.2 Collaborative-based Filtering

“Collaborative-based filtering filters information by using the interactions and data collected by the system from other users. It’s based on the idea that people who agreed in their evaluation of certain items are likely to agree again in the future.”

We will use user-based ratings to recommend similar books to the user. For example, If you and your friend have quite similar likes and dislike in choice of books. Whenever in doubt you can ask your friend to recommend you a book to read. That is what this recommender will help us do.

An example of collaborative filtering based on a rating system

Code: Function for recommending books using user-ratings

Step1: Implement the fucntion to recommend books using user_id
def recommend_books(user_id):# Getting all the ratings that has done the useruser = df_ratings[df_ratings['user_id'] == user_id]user = user.join(df_titles)['title']user = df_titles.copy()user = user.reset_index()# We get the books that the user has not yet read.user = user[~user['book_id'].isin(df_books)]# We check the predicted score using the predict function and getting the estimation.user['estimate_score'] = user['book_id'].apply(lambda x: svd.predict(user_id, x).est)# Sort the books by the estimate score that the predict returns.user = user.drop('book_id', axis=1)user = user.sort_values('estimate_score', ascending=False)return user.head(10)Step2 : Load the dataset and run the SVD model. Use Kfold cross validation to validate the data.reader = Reader()
data = Dataset.load_from_df(df_ratings[['user_id','book_id','rating']], reader)kf = KFold(n_splits=5)kf.split(data)algo = SVD()# Run 5-fold cross-validation and then print resultscross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)trainset = data.build_full_trainset()svd.fit(trainset)Step3: Give desired user_id to the function
user_id = 127recommend_books(user_id)

Output:

Voila!! User 125 has the recommendations of these books. Isn’t this is cool?

4. Limitations & Conclusion

Few conclusion which I could make from these two recommender systems are :

By analyzing user’s current choices and previous history, recommender systems benefit the industries a lot in providing useful and relevant information.
Limitations for Content-based recommendation system:

a)New users won’t have enough information to build a profile

b)On the other hand, these systems are capable of suggesting books that are related to other books without user profiles.

3. Limitations for Collaborative-based recommendation system:

a) These recommendation systems find it difficult to analyze new entries because these were not present during the time the training dataset was getting trained. That is why the system cannot recommend them.

In the end, recommender systems are very interesting ways to give suggestions in any field like movies, videos, articles, etc. The two recommendation systems gave quite good results and the main goal was achieved.

Areas like evaluating the recommender system, test train split is not been covered but they are worth exploring. I hope you liked the blog. While you try this, I would grab my new suggested book and read it!