Similar books in the Seattle Public Library

Published in

INST414: Data Science Techniques

5 min readApr 7, 2024

The question that I set out to answer is how can we find books that are similar to other books? This question would be important to a Librarian who is attempting to help a student find reading materials that are similar to ones they enjoyed. Data to answer this question would be information about a number of books such as author, theme, age, and more. The data that I was able to find to answer this question comes from the Seattle Public Library system and contains mostly book titles and the subjects that the book falls under.

In order to collect this data I had to use the Open Data Seattle API:

url = 'https://data.seattle.gov/resource/6vkj-f5xf.json'
response = requests.get(url)
library = response.json()

max_books = 100005006
current_book = 0

all_books = []

while current_book < max_books:
    print(f'{url}$offset={current_book}')
    response = requests.get(f'{url}?$offset={current_book}')
    library = response.json()

    #do stuff

    for book in library:
        all_books.append(book)

    current_book += 1000

When I first ran this code I had just copied the max_books value from the Seattle website thinking that it was a hundred thousand rows and not realizing that it was a hundred million. After realizing the code had run for close to 30 minutes I discovered my error and stopped the process. The code had already pulled over one million rows and I was able to export this to a csv to use for analysis.

Next I needed to convert the data I had which looked like a typical library catalog entry into something more useful for data analysis. In order to make the data more useable I selected only the first 45 thousand rows and renamed the id column to be more useable.

books = books.rename(columns= {'Unnamed: 0': 'book_id'})
books = books[0:45000]

As my final preparation step I created a dictionary of dictionaries containing each book and its associated genres.

book_name_map = {}
book_subjects_map = {}
def createBookSubjectMap(row):
    subjects = book_subjects_map[row.book_id] = row.subjects.split(", ")

    this_books_subjects = {}
    for s in subjects:
        this_books_subjects[s] = this_books_subjects.get(s, 0) + 1

    book_subjects_map[row.book_id] = this_books_subjects

    book_name_map[row.book_id] = row.title

    return row
book_subjects.apply(createBookSubjectMap, axis=1)

Now that I had all of my data cleaned I was ready to begin analysis. To do this I chose to use Euclidean distance for my analysis since all of the magnitudes of my data are similar, they are either in the subject or not (1 or 0). Once transformed into a dataframe this new dataset looked like this:

index = book_subjects_map.keys()
rows = [book_subjects_map[k] for k in index]

subject_df = pd.DataFrame(rows, index=index)

You might notice some of the categories are pretty specific and only apply to a handful of books but this helped alleviate some of my concerns that genres would be too common to try and group similar items by.

Finally I needed to pick 3 books to use to search for similar books. To do this I chose the following

Harry Potter and the Deathly Hallows by J.K. Rowling
Among the Brave by Margaret Haddix
Pizza by Vincenzo Buonassisi

I needed to find the book id’s for these books so I searched by dataframe to find the id switching the strings I was searching for each time:

book_subjects.loc[book_subjects['title'].str.startswith('Pizza')]

The first similarity results I had were from Harry Potter, I was expecting the other books of the Harry Potter series to appear in this. One other Harry Potter book came up but none of the others did which was a little strange to me.

21264 Harry Potter and the deathly hallow 0.0
8922 Harry Potter and the Chamber of Sec 2.0
14468 Greenwitch / Susan Cooper. 2.0
21709 The Perilous Gard / Elizabeth Marie 2.0
22168 Krabat & the sorcerer's mill / Otfr 2.0
4 Shrines / Purity Ring. 2.23606797749979
7 Not all animals are blue : a big bo 2.23606797749979
22 Belarus / Patricia Levy and Michael 2.23606797749979
25 Bihasang pagsasabong : isinalin ng  2.23606797749979
36 History of the thirteen / Honoré de 2.23606797749979
38 Trinidad and Tobago / Sean Sheehan  2.23606797749979

Up next Among the Brave was another series that I had read during my childhood made up by a number of books. Again in this analysis the other books in the series were very absent from the analysis, but overall this seems to be the most accurate similarity analysis. The other books suggested for the first few results are all in the same dystopian genre where freedom fighters go against an oppressive government.

Finally the last book that I picked from the list of books was a random book about pizza which I knew nothing about going into the analysis.

One other book involved pizza and was quite similar to the book about pizza but it seems like everything else that was considered similar by the algorithm didn’t involve anything about pizza, especially the Toyota Carola repair manual which was for some reason included.

In theory these books should be the most similar to every the books that I inputted into the algorithm giving the librarian the answer of the most similar books to what the library user requested.

Limitations

There are a few major limitations to this analysis. First I had started with about 1% of the total library catalog, then I dramatically reduced this size even further. Neither of these reductions was a true random sample and instead was the first X number of results.

A few title appeared in the similarity list for all three book I inputted even though they were in vastly different genres. For example the book “Not all Animals are Blue” appeared in each result even though it is a children's book aimed to teach kinds how to differentiate between shapes and colors, nothing to do with any of the books I chose to input.

The final limitation is that I only included the subjects and no other data such as the author or information about books in a series.

Github Link: https://github.com/not-senate/module3_assignment

Raw data file (too big for github): https://drive.google.com/drive/folders/1E1AF7o9o95veXz6kBOVWcAmLzp-eW9Ap?usp=sharing

Similar books in the Seattle Public Library

Limitations

Written by Jdavitz