Scraping Goodreads.com: Data Munging (Part I)

Published in

Connor Higgins

6 min readDec 9, 2019

Introduction

Web-scraping is the process of collecting data from a website, such as prices from Newegg.com, Amazon, etc. It is a really helpful asset for any stats/comp sci student, professional or hobbyist to have. One simple use you might have would be to extend the “reach” of your projects–no easily available public dataset for your project? Then you can make one yourself with a scraper (also known as a spider)!

This project will be a multi-parter:

Data Collection: Building a scraper to collect and organize data from Reading Lists on Goodreads.com
Data Organization: Use the scraper to build a relational database
Visualization and Analysis: Detailed analysis and statistical testing

Feel free to use any code used in this project, it will be available on Github.

Hope you enjoy reading, because we are looking at Goodreads.com!

Goodreads

For those unaware, Goodreads.com is a site where readers can rank and review books they read. It is an excellent source for book recommendations, and also for tracking the books you’ve read. The central idea behind Goodreads is that readers care more for recommendations by fellow readers than “official” bestseller’s lists–not a bad idea.

My idea for this project on the other hand was a vague plan to analyze the books that I enjoy reading.

So how to move from “vague plan”? All the information I could want can be found on the main site, there is simply the matter of getting it into a dataset. “Web scraping” is the process of collecting data from a web page. It sounds complicated, but really the process is quite simple!

Preview of the Final Product

Sneak peak of the resulting dataset for this page. We will be building a function that can build a dataset given the url of any public list on Goodreads (check the listopedia on the site for others).

Excellent Space Opera

Quick Primer

Tools

I will be using rvest for scraping as I am working in R. Python user? Try beautifulsoup.
selectorGadget (Chrome)
Other packages:
dplyr
scales
magrittr*
ggplot2

*rvest is designed to work with magrittr, though it is not strictly necessary. Magrittr lets you pipe functions with “%>%”.

i.e.

# 'a' gets sent as the first argument to the next function by %>%
library(magrittr)a%as.numeric()%>%max()## [1] 5

Web Scraping Sample

Generally speaking we only have a few steps to follow in this project. If you are scraping a different site things may be different. You might run into a website not written in HTML, or a site which is against web scrapers (which happens).

Take a look at this small part of the list we are looking at here. We’ll select the titles of the books here.

To select the title we need a “path” that we will give to rvest’s html_nodes() function. You can get that path using SelectorGadget. Usage is simple:

Left click the element(s) you want. SelectorGadget will make a guess at the path, and the elements included in the guess get highlighted (in yellow).
Right click elements you don’t want. SelectorGadget will use these to perfect the guess.
Continue until only the elements you want are highlighted.

Lastly we use rvest to get the titles off this page. We will need the path SelectorGadget is now giving us (bottom right in this image).

Copy and paste the path and use it in the html_nodes() function below.

#Retrieves the html text of a page in its entirety, hardly suited for use yet.
list_Main<-read_html("https://www.goodreads.com/list/show/1127")  #html_nodes()<-Takes the path we found, and retrieves highlighted elements.
  #html_text()<-Extracts the text from what html_nodes() found.
  Book.titles%html_text()

These are the general steps. Since webpages are put together by humans (so sometimes the “pattern” we are looking for won’t hold) and constantly updated, you may have to fiddle with it depending on the site.

Goodreads is html, which makes it relatively easy to scrape with rvest. XML pages can be scraped with the “xml” library.

Data Collection

Breaking Down the Steps

Sometimes this is simple, sometimes less so. In my case I want to look at lists books I enjoy reading.

The end goal is to build a dataset by scraping a public reading list on Goodreads.

Here is the list we will use in this example: Excellent Space Operas (Goodreads.com)

We can get the following information right from this page:

Title
Author
urls to each book’s page
Goodreads Book ID (this we can get from the url with some regular expressions)

Also notice this little thing at the bottom?

We will scrape this part, to figure out how many pages we need to “click through” to get the whole list. In this case we will have to visit 5 pages.

By looking at the hyperlinks from here we can visit each work’s individual pages, and get even more information such as page lengths. For example, Old Man’s War (first work in this list).

In sum here’s the structure of the webscraper:

Scrape the first page of the list
Use this to figure out the number of pages we need to visit
Initalize a dataframe (“data”), which holds information on each work found
First “For”“ Loop: from 1:(number of webpages to view list) <- “Scrape the List”
Collect available information (title, ID, author, hyperlinks)
Add all of these to the dataframe we initialized (“data”)
Second “For”“ Loop: 1:(number of works found) <- “Scrape Individual Book Pages”
Use the hyperlinks from the last step to visit each work’s individual pages.
Collect additional information (page lengths, genre tags, summaries, total reviews, average rating, total ratings, proportion of reviewers that wrote text reviews)

Scrape the List

Below is the code which is used to scrape the list itself. In other words, everything up to the second bullet in the outline above.

The full code can be found here.

Feel free to glance through it and see how this works.

goodReads.webscrape<-function(listUrl){  #Takes the url of a reading list, scrapes it. This collects urls, goodReads ID's, titles, and authors.  #The urls are used by the second part of the scraper to visit each book's page, and collect additional info.  #Read the first page, get the number of webpages that make up the list.  List_Main<-read_html(listUrl)
  pages%html_text()
  listIndex<-as.numeric(pages[length(pages)-1])  #Exception for short lists less than 100 books  if(length(listIndex)==0){
    listIndex=1
  }
  guide<-list()
  data<-data.frame()  #That first "for" loop.
  for(i in 1:listIndex){
    ReadList%html_nodes('.bookTitle span')%>%html_text()
    authors%html_nodes("td")%>%html_nodes('.authorName span')%>%html_text()
    hyper%html_nodes("td")%>%html_nodes('a.bookTitle')%>%html_attr("href")    #Extract Goodreads book ID from url (using regular expressions).
    #gsub(): Replaces anything matching the regular expression with ""    #Second gsub() is used to catch an occasional exception.    goodreadsID<-gsub(".*/book/show/\\s*|-.*", "", hyper)
    goodreadsID%as.numeric()    a<-cbind(goodreadsID,title,authors,hyper)
    data<-rbind(data,a)
    complete<-i/listIndex    #Handy progress message to let the user keep track of the script.     cat("Completion (Step 1): ",format(percent(complete),digits=4,justify="left"),"\n","Works Found: ",nrow(data),"\n")
  }  cat("Step 1 Complete! Found ",nrow(data)," separate works in this list.", "\n")
  return(data)
}

Conclusion

With that I now have a web scraping function that can be used on any Goodreads list, given its url (though it is not perfect). But there is an added bonus.

Remember we also captured each work’s individual ID number. You may recognize that we can use this as a “primary key”, which is key component of relational databases.

The function lets us scrape individual reading lists and organize them into tabular data (think spreadsheets, dataframes in R, etc.). The ID codes let us relate our “lists of reading lists” with a relational database.

This is why I made this be multi-parter. We’ve created a data collection and data munging function, but this is only the beginning. We can then use the resulting database for analysis and visualization (which will be the next posts)!

Github