Finding Relationships and Similarities Within Books: The Foundation for A Book Recommendation Tool

Published in

INST414: Data Science Techniques

16 min readMay 17, 2023

By: Ben Griffith, Daniel Gonzalez, and Prince Okpoziakpo

Introduction

Books are an essential part of human society. They have the power to educate, enlighten, and entertain us. They allow us to explore new worlds, experience new perspectives, and learn from the experiences of others. Reading books can help us develop empathy, critical thinking skills, and creativity. Overall, books are an invaluable resource for personal growth and development. However, selecting a good book can be very difficult. In 2022, there were over 120 million books in the world! It goes without saying that people need assistance getting access to books that they will actually enjoy. The very existence of the librarian is a testament to this claim. This is why our group set out to use various machine learning techniques to build the foundation of a book recommendation tool that would satisfy the needs of readers of all stages. Our goal for this project was to utilize machine learning techniques and a range of similarity measures to lay the groundwork for a book recommendation tool. The methods used throughout our analyses can then be used in future technologies that are designed to provide book recommendations to book readers based on their preferences.

Who can benefit from our work?

Our project caters to a diverse audience, including publishers, writers, business owners, and average book readers. Publishers and Business owners can utilize our work to make data driven decisions. Writers can take advantage of our project as a valuable resource for honing their craft, gaining exposure, and connecting with readers. Average book readers and book enthusiasts are specifically our project’s main audience. In summary, our project serves as a valuable platform for publishers to promote their books, writers to showcase their work, business owners to engage with a targeted audience, and average book readers to immerse themselves in a community-driven reading experience. For this analysis, we will focus on the benefits of our work for the average reader.

Data Gathering

Our data collection process began with gathering a list of International Standard Book Numbers (ISBNs) in order to make application programming interface (API) requests. We started our data collection process with this need because we wanted to use an API that deals with books in order to create our own data based on our needs for the different types of analysis we wanted to attempt. For this reason, we made use of a Kaggle dataset, along with the use of the Google Books API. The Kaggle dataset consisted of book data that was pulled from the Goodreads API. We initially intended to use this Goodreads API, but after contacting Goodreads via email, we had to accept the fact that they no longer provide new developers with new API read or write authorization keys. This was unfortunate because the Goodreads API would have provided many features that we were looking for to populate our dataset in order to make a variety of analyses on both integer and non-numerical values. Due to this roadblock, we had to start with the Kaggle dataset, which holds 11,128 unique books as rows, along with 11 features as columns. This file size was about 1.48 MB in file size. While this data was clean, it was not very useful, so we instead subset the data to then use it with the Google Books API.

Sample of the Kaggle Dataset containing Goodreads Data

After subsetting the Kaggle dataset into two Pandas data frames, one for only the ISBN numbers and another for the ISBN-13 numbers, we iterated through these data frames to pull book data through the use of the Google Books API. We initially attempted to gather all necessary features with this Google Books API, but after forming a table for the data gathered, we noticed that we did not have useful results for each column. For example, the genres column would only return one genre per book, two if we were lucky, and none if we were not lucky at all. In addition to this, only about one-eighth of all books were returned with a description. With these two issues alone, similarity comparisons such as Jaccard similarity and sentiment analysis comparisons would not be very meaningful. In the process of doing this data retrieval from the Google Books API, our dataset was reduced to 3,287 books. This was due to not all of the ISBN and ISBN-13 numbers from the Kaggle dataset matching up with the Google Books API database. Although, for each of the 3,287 books that were retrieved, the JSON data returned for each Google Books API request included an identification value in addition to the ISBN or ISBN-13 number that was already provided. We then created a subset data frame for these new identification values to use for a final round of data gathering.

At this point in our data collection process, we were out of options on where to look for data that was useful for the scope of our project. With our new subset data frame of identification values, we used web scraping in order to create our final dataset. We used Selenium as our web driver for web scraping. Upon looking, the Google Books Index provided web pages for what we assumed to be the books that we were able to retrieve with the google books API already and many more. We then noticed that with the subset of identification values mentioned, we were able to pass those values through the google books index website link, and with the power of web scraping could act as an endpoint. We then created a function that we would then use to iterate through the subset of identification values. This function would open a google books index page for each book, search through the HTML and pull available features.

This allowed us to finally have many rows and features filled with meaningful data. For example, this allowed us to pull the data for each book for their genre, and subject. This also allowed us to pull the description for almost all of the books, and significantly reduce the null values in our dataset of books overall. At this point, we then returned to the google books API to gather rating and rating count values for each book. We did this by requesting book data with the subset of identification numbers and creating a data frame consisting of identification numbers, average ratings, and rating counts. Lastly, we merged this new dataset and the one created from web scraping on the identification values with a left merge. This resulted in a dataset with 3,287 books as rows, and 22 total columns all the book features. The file size for our final data set was roughly 10.8 MB, which is shown below.

Combined Kaggle and Google Books API Dataset

Data Cleaning

Before we could use our data for any analysis we needed to clean our data. The process of gathering the data included data cleaning in itself. For example, in the genres column, the data that was found in the web scraping process was stripped, and placed in the column as a list to allow for easier data analysis later on. After these steps, we wrote a function along with snippets of code that can be found throughout our repository which helped prep the data when needed depending on the analysis being done. For example, in our “preprocess_text” function, we take in text and first tokenize it, then all stop words are removed. Then we lemmatize the words, this process normalizes the words, shaping them down to the word stems, allowing for a more fair comparison in each analysis. Although, depending on the tasks being done, a snippet of code tailored to the specific analysis was used instead.

Data Analysis

The features bookID, title, authors, average_rating, ISBN, ISBN-13, ratings_count, and publisher are all useful for training machine learning models for book recommendations. These features can be used to identify similar books, find more information about books, determine the popularity of books, and find other books from the same publisher. Additionally, training a classifier based on the words present in the description of a book can be used to identify books that are similar to a book that the user has already read. This can be helpful for recommending books to users who are looking for something new to read. In addition to the features listed above, other features that could be used to train machine learning models for book recommendation include the genre, publication date, number of pages, language, reviews, and price of the book. By using a variety of features, machine learning models can be trained to recommend books that are likely to be enjoyed by the user.

Euclidean Similarity Analysis

One method we explored as a potential way to recommend books to readers was Euclidean Distance. The purpose of this analysis is to reduce each book and its relevant features into 1 number, then calculate the distance between the books’ numbers. Early on, we conducted this analysis using the original data set even though it was missing much data on columns such as genre, ratings, authors, and others. For this first analysis, we decided to remove all rows with nan values in the description, page_count, or ratings_count columns. We did this by finding the index of each row with nan values in these columns, then dropping each row whose index was included. We then noticed that there were no 5 rated books with more than about 10 ratings, so we dropped all books with an average rating of 5. Next, we dropped all of the stop words from the descriptions so they would not affect our analysis. Now that the data was ready for analysis, we used pandas’ get_dummies() function to convert each of the books’ values to a binary value to calculate distances. We then used sklearn’s DistanceMetric library to perform a Euclidean distance calculation on the books. For this test, we were finding the top 5 most similar books to the highest-rated book from our data set. Unfortunately, the most similar book returned a Euclidean distance measure of over 3000, meaning it was not very similar to the target book at all. For this reason, it was clear we could not successfully perform analysis on our data until we had more full values for our books’ features.

Fortunately, we were able to use web scraping techniques to gather much better data about our books such as genre, subject, description, and more. With this better data, we decided to conduct a new Euclidean distance analysis to see if it could still be an option for recommendations. For this second analysis, we decided to measure similarity using the following features of our books; title, page count, publishing date, publisher, language, author, genres, subjects, and description. Once we had these features selected in a data frame, we used pandas’ dropna()function to drop all nan values from the data. We again used get_dummies() to convert the features to numerical values, but this time we decided to also cluster the points in order to see if books recommended as similar were in similar clusters. For the clustering, we used TruncatedSVD to convert each book into an ordered pair. We then used the elbow method to determine there should be 5 clusters. Finally, we plotted the clusters as seen below.

We then performed Euclidean distance again and returned the 5 most similar books to a new target book. The results this time were better, but still poor as the most similar book recorded a similarity measure of over 1800. Furthermore, the book was not in the same cluster as the target book. The biggest similarities we could find between these 6 books were that they all were written in English, printed on paperback, and 5 of 6 were published by the same publisher. Based on these results, using Euclidean distance to produce book recommendations with this data may be counterintuitive and not truly recommend books based on content similarities. The method could still be used as a simple book recommendation system though, potentially including a ranking of how similar each book is to the target book so that the user can make a judgment on the recommendations. An example of this can be seen below.

Decision Tree Analysis

We also trained a Decision Tree classifier on the descriptions of books. Decision trees can be trained to classify books based on the words that are present in the book description by recursively splitting the data into two or more subsets based on the most important words. The most important words are determined by their information gain, which is a measure of how much information each word provides about the category of the book. The process is repeated until each subset contains only books of the same category. This approach has several advantages. First, it is a simple and efficient way to classify books, with a low barrier to understanding. Second, it can be used to classify books into a wide variety of categories. Third, it can be used to classify books even if the book descriptions are incomplete or inaccurate. However, this approach also has some disadvantages. First, it can be difficult to choose the most important words to split the data on. Second, the decision tree can be sensitive to noise in the data, which can come in the form of stop words and special characters. Third, the decision tree can be difficult to interpret.

The steps we took to gain insights into our dataset using a Decision Tree classifier are as follows: Data preprocessing; Feature Extraction; Model Training and Testing; and Model Evaluation. The first step in any data analysis project is to preprocess the data. We converted the description of each into vectors using the bag of words approach, where each document is represented as a vector of word counts. The vocabulary set, the union of all words in each document, had a cardinality of 21,927. The second step was extracting the target column, which was the ‘genres’ column. After extracting the column, we converted it into a multi-label target represented as a matrix, as each book can be assigned multiple genres. The third step was training the Decision Tree classifier. We trained a Decision Tree classifier on 80% of the data, the training set, and tested the model on another 20%, the test set, and evaluated the performance of the model.

After completing each of the aforementioned steps, we used our cleaned dataset and model to gain some insights into the data. The first insight was the prevalence of specific genres in our dataset. Fiction, Fantasy, Romance, Novel, and Mystery were the 5 most popular genres in our dataset, as the Word Cloud visualization shows. This lets us know what book genres the population of readers are more interested in, as far as the proportions of the genres go. As for insights produced by the Decision Tree classifier, we were able to discover subsets of genres based on the presence of specific keywords in the description of the book. The diagram below shows how the Decision Tree classifier partitioned the dataset to optimize information gain and reduce entropy. The visualization below showed that there is a high variance in the dataset; this means that there is a diverse number of books extant in our dataset, with dissimilar content. Therefore, we would need to create user profiles to be able to detect when a user shows tendencies that are correlated with the subsets defined by the classifier.

Text-Focused Similarity Measures

In this portion of our analysis, we wanted to investigate the relationships between books considering their genre and book description. In order to do this we executed Jaccard similarity considering genres for each book, and applied sentiment analysis on each book description. With these similarities and sentiment values, we were able to provide valuable insights.

Both of these analyses were done separately, but before applying either analysis to our dataset we needed to prep the strings of text in each column. The data cleaning process included removing duplicate books based on their titles, and dropping any row considering null values for either genre or description depending on which analysis was being done. This ensured that empty strings or lists were not being considered while making comparisons. Next, for each analysis, we used various techniques from the NLTK library. This involved tokenizing the text, removing common stop words, and lemmatizing the words for normalization purposes. This preprocessing step was crucial in improving the accuracy and quality of the sentiment analysis results.

Moving on, the Jaccard similarity analysis focused on examining the relationships between books based on their genres. By calculating the Jaccard similarity coefficient, we quantified the similarity between different pairs of genres. The math for this calculation was done by dividing the size of the intersection (common genres) between two books by the size of their union (combined unique genres). This allowed us to identify books with similar genre characteristics, which provided insights into potential relatedness and recommendations based on genre preferences.

On the other hand, the sentiment analysis helped examine the emotional tone expressed in the book descriptions. By applying sentiment analysis techniques, we assessed the sentiment polarity of each description, capturing the overall positive, neutral, or negative sentiment associated with the text. In the same order, a book that was most positive would return a sentiment value closer to 1.00, a neutral sentiment value would be closer to 0.00, and a negative sentiment value would be closer to -1.00. In order to obtain the sentiment values, we use the TextBlob Python library to analyze the pre-processed text. This in the future could provide us with a deeper understanding of readers’ preferences, allowing us to tailor book recommendations based on emotional preferences.

To demonstrate the application of these analyses we tested various applications. For the Jaccard similarity analysis, we tested what books were most similar to each other. For example, we chose a book and printed out the top 10 most similar to that selected books based on the genres.

Similarity Score Sample for ‘The Flying Sorcerers’

Then lastly, for the sentiment analysis, we printed out the top 10 books most negative in sentiment.

In addition to the top 10 books most positive in sentiment.

Challenges and Limitations

The data-gathering process faced several challenges and limitations, impacting the quality and comprehensiveness of the dataset. Initially, outdated or discontinued APIs posed difficulties, requiring a creative approach to find alternative solutions. This led to utilizing a Kaggle dataset and accessing book data through the Google Books API from a different perspective. However, the Google Books API presented issues with incomplete data, such as limited genre information and a significant number of books lacking descriptions. Mismatched ISBN numbers further reduced the dataset size and potentially introduced bias.

To overcome the limitations of the Google Books API, web scraping using Selenium was done. Although it provided more meaningful data, web scraping introduced complexity and the risk of errors due to website structure changes. These challenges highlight the need for careful consideration and interpretation of the dataset, acknowledging possible errors or biases introduced during data collection and cleaning.

During the analysis, the limited availability of scraped data hindered the project’s success. The missing genre and description data, crucial for comparing book similarity and recommending reads, restricted the analysis. Acquiring a fuller dataset earlier would have allowed for a more thorough analysis and the development of a more structured recommendation system. Furthermore, the abundance of book attributes complicated the decision-making process, as some had to be prioritized over others. This complexity extended to the decision tree, increasing the difficulty of training the model due to the large number of branches resulting from the diverse attributes.

Conclusion

In this project, we set out to build the foundation for a book recommendation tool using various machine-learning techniques. While we encountered challenges and limitations in terms of data availability and completeness, we were able to gain insights into the potential of different analyses for book recommendations.

Despite the limitations, our analyses demonstrated the potential of using machine learning techniques to recommend books based on various features such as descriptions, genres, and similarity measures. The decision tree analysis showed promise in classifying books based on their descriptions, while Euclidean distance and Jaccard similarity analyses provided insights into book similarities. However, it is important to note that the accuracy and effectiveness of the recommendations heavily rely on the quality and completeness of the data.

Overall, this project serves as a valuable stepping stone toward developing a more sophisticated book recommendation tool. Further work can be done to address the limitations we encountered, such as gathering more comprehensive data, improving feature availability, and refining the machine learning models. By addressing these challenges, we can strive to create a robust and accurate book recommendation tool that caters to the diverse needs of readers, publishers, writers, and business owners.

Appendix: Work Distribution

The work in this project was distributed among the team members as follows:

Ben Griffith contributed to the initial exploration of the Google Books API, background research of previous work, as well as analysis for the output of our project. Ben was responsible for conducting a Euclidean similarity analysis consisting of cleaning data, calculating distance, clustering data, and determining key findings from the results. Ben’s background research included compiling a detailed list of existing book recommendation services for the group to observe, helping decide our project’s path. Finally, Ben was a contributor to sections of the final report such as the description of his analysis, answers for our stakeholders, and our limitations and conclusion.

Daniel Gonzalez gathered the data necessary to complete the various analysis for this project. Daniel obtained the initial Kaggle dataset, made API requests with the google books API, and subset the data in order to use web scraping methods to ultimately form the final dataset that held all of the necessary features for the various analyses that were done. In addition to gathering the data, Daniel also completed the Jaccard similarity tests and the sentiment analysis comparisons.

Prince Okpoziakpo provided the Google Books API data during the preliminary phases of the project and determined the approach to creating the foundation of Machine Learning techniques that would be useful for creating a book recommendation tool. He also contributed to the data preprocessing for, and training of, the Decision Tree classifier, which is one of the algorithms we explored in the report.

The collaboration and synergy among the team members were crucial in successfully completing the project. While these descriptions provide an overview of each team member’s contributions, it’s important to note that there was active participation and frequent discussions among all team members throughout the project lifecycle.

To view our, visit our git repository: Link.