James van Doorn
INST414: Data Science Techniques
3 min readOct 27, 2023

--

Exploring Similarity Queries in Web-Based Data

Non-obvious Insight

The non-obvious insights I want to extract from my data are which books are most similar to Charlotte’s Web, The Lord of the Rings, and Oh, the Places You’ll Go! These insights can inform what books or authors to read next if one is a fan of these books. By identifying similar books, we also can inform decisions related to marketing strategies, the stocking of bookstores such as Barnes and Noble, and even the development of new literary works.

Data Source and Features

The dataset I’ve chosen is a collection of book data, downloaded from Kaggle as a csv (7k Books (kaggle.com)). It contains information on titles, subtitles, authors, categories, and book descriptions. For our similarity calculations, I’ll be using the book descriptions. The chosen similarity metric is the cosine similarity, a popular method for text comparison.

Top 10 Similar Books for Each Book

Charlotte’s Web

1. the non-designer’s web book

2. agile web development with rails

3. the best american travel writing 2006

4. last man standing

5. php and mysql for dummies

6. charlotte’s web signature edition

7. charlotte’s web (full color)

8. i am charlotte simmons

9. charlotte’s web: wilbur finds a friend

10. css cookbook

The Lord of the Rings

1. the lord of the rings sketchbook

2. the lord of the rings complete visual companion

3. 五輪書 : 英文版

4. the return of the king

5. the tolkien reader

6. j.r.r. tolkien

7. the tolkien companion

8. bored of the rings

9. the hobbit / the lord of the rings

10. the history of the lord of the rings

Oh, the Places You’ll Go!

1. he’s just not that into you (the newly expanded edition)

2. cliffsnotes on euripides’ medea & electra

3. unlimited power

4. new york city’s best dive bars

5. behind closed doors

6. ten days to self-esteem

7. the great good place

8. simply beautiful beaded jewelry

9. the purpose of your life

10. the power of infinite love & gratitude

Software:

For this analysis, I made use of Python’s pandas module and scikit-learn. These tools provided a solid foundation for data manipulation, text analysis, and similarity calculations.

Pandas allowed me to load, clean, and preprocess the dataset. With its DataFrame structure, I could easily work with the data, applying transformations and manipulations as needed. In this analysis, I also leveraged scikit-learn’s TfidfVectorizer and linear_kernel to calculate the cosine similarity between book descriptions.

Data Cleanup:

While I didn’t encounter any bugs, this dataset required some cleaning. The data cleanup involved a few steps:

  • Lowercasing Titles: All book titles were converted to lowercase to ensure uniformity and prevent case-related discrepancies in text comparisons.
  • Handling Missing Descriptions: Null or missing descriptions were filled with empty strings, allowing us to include all available information in the analysis.
  • Special Characters: Data preprocessing addressed issues with special characters in titles, ensuring that they didn’t interfere with string operations.

Findings:

The combined average rating for the top 10 books similar to each book

Limitations and Biases:

One limitation of this analysis is that I’ve primarily focused on book descriptions for similarity, potentially missing nuances in other features. Additionally, the dataset’s quality and completeness can affect results. This dataset only contains around 7,000 books, meaning there were many books that were left out, and we don’t know whether or not the books included are proportional to all books released in terms of genres, authors, or other categories. The algorithm might also introduce biases based on the content of descriptions and titles. For example, there are some books in the dataset that are likely almost the same, such as Charlotte’s Web and Charlotte’s Web Signature Edition, and a user would most likely not consider this entry useful due to it being the same book. More preprocessing and potentially natural language processing techniques could be used to eliminate such issues and others. Lastly, cosine similarity, used in this model as the similarity metric, uses the “bag-of-words” model, which treats text as an unordered set of words, likely not capturing the meaning or context of words, leading to books that aren’t actually similar to be included in the output.

GitHub Link: inst414_work/assignment3.ipynb at main · jvand0/inst414_work (github.com)

--

--