Webscraper Overview (Goodreads Part II)

Published in

Connor Higgins

4 min readDec 9, 2019

This is to provide a short overview of the use and capabilties of this webscraper I’ve written in R. This can be used to create datasets that will allow analysis of genres of interest.

To use the scraping function, first you will need to download and source the script.

All that is needed is the url of a reading list of your choice and and a stable internet connection. Towards the end of the script itself (below or on Github) is a short set of examples you can look at, commented out. For a short demo, I suggest “excellentSpaceOpera” — it is a short list and the function will take just a few minutes to return a result.

Argument

url of a list of your choice

Output

A dataframe with the following columns:

goodReadsID (unique, numeric): Can be used to merge retrieved data either using merge() in R, or a JOIN statement in SQL.
Title
Authors
Hyper: hyperlink that directs to the work, not including the “Goodreads.com” prefix
Page Counts: Total number of pages
proportion_text_reviews (numeric): The proportion of reviewers that also contributed a written review (rather than just rating 1 to 5 stars)
average_rating: Average rating on Goodreads.com
genreVoted: The highest voted genre “tag” on the reviewers decided on
hasAward (True/False): Whether this work lists either recieving or being nominated for an award
total_ratings: Total number of ratings recieved on GoodReads
book.descriptions: Plaintext synopsis

Example (Book descriptions not shown here)

goodreadsIDtitleauthorshyperpageCountsproportion_text_reviewsaverage_ratinggenreVotedhasAwardtotal_ratings51964Old Man’s War (Old Man’s War, #1)John Scalzi/book/show/51964.Old_Man_s_War3515.71e-054.23Science FictionFALSE122567234225Dune (Dune Chronicles, #1)Frank Herbert/book/show/234225.Dune6042.47e-054.20Science FictionFALSE56793017214Starship TroopersRobert A. Heinlein/book/show/17214.Starship_Troopers3352.53e-054.00Science FictionFALSE1582678855321Leviathan Wakes (The Expanse, #1)James S.A. Corey/book/show/8855321-leviathan-wakes5617.81e-054.22Science FictionFALSE10236845252Pandora’s StarPeter F. Hamilton/book/show/45252.Pandora_s_Star7682.87e-054.24Science FictionFALSE34799

Postscript

Additionally the webscraper in its entirety can be found here or on Github.

library(rvest)
library(dplyr)
library(scales)
library(magrittr)
library(ggplot2)# This Rscript can roughly be divided into the following sections:
  # 1) Small supporting functions that are used to collect specific sections from a webpage. Used by the webscraper.
  # 2) Webscraper with the following components (for loops):
        #2a) Scrape the list compose a list of books (and their urls). Example: 'https://www.goodreads.com/list/show/1127'
        #2b) Use the urls to visit the pages of each book on the list, collecting detailed information.
  # 3) Finally call the webscraper function, and all the reading lists scraped for this project. I've taken the liberty of 
  #    of commenting out all but the smallest list so running a demo doesn't take hours.
# Section 1: Supporting functions
find.description<-function(book.url){
  desc%html_text()%>%max()
  return(desc)
}find.page.length<-function(book.url){
  book_page<-html_nodes(book.url,'#details span+ span')[1]
  pageCount<-html_text(book_page)
  if(length(book_page)==0){
    pageCounts<-NA
  }
  else{
    pageCounts<-as.numeric(gsub(" pages","",pageCount))
  }
  return(as.numeric(pageCounts))
}find.genre.tag<-function(book.url){
  genre<-html_nodes(book.url,'.elementList:nth-child(1) .left .bookPageGenreLink')
  if(length(genre)==0){
    genreVoted=NA
  }
  else{
    genreVoted<-gsub( " * *", "", genre)
  }
  return(genreVoted)
}check.for.awards<-function(book.url){
  award<-html_nodes(book.url,'.clear+ .clearFloats .infoBoxRowItem')
  if(length(award)==0){
    hasAward<-FALSE
  }
  else{
    hasAward<-TRUE
  }
  return(hasAward)
}find.total.reviews<-function(book.url){
  if(length(html_nodes(book.url,'.votes'))){
    ratings<-html_text(html_nodes(book.url,'.votes'))
    ratings<-gsub("[[:space:]]", "", ratings)
    ratings%as.numeric()
  }
  else{
    ratings<-NA
  }
  return(ratings)
}find.text.reviews<-function(book.url){
  if(length(html_nodes(book.url,'.count'))!=0){
    reviews<-html_text(html_nodes(book.url,'.count'))
    text_reviews<-as.numeric(regmatches(reviews, regexpr("[[:digit:]]+",reviews)))
  }
  else{
    text_reviews<-NA
  }
  return(text_reviews)
}#Section 2: The main webscraping function
goodReads.webscrape<-function(listUrl){  #Section 2a: Takes the url of a reading list, scrapes it. This collects urls, goodReads ID's, titles, and authors.
  #The urls are used by the second part of the scraper to visit each book's page, and collect additional info.
  List_Main<-read_html(listUrl)
  pages%html_text()
  listIndex<-as.numeric(pages[length(pages)-1])
  #Exception for short lists less than 100 books
  if(length(listIndex)==0){
    listIndex=1
  }
  guide<-list()
  data<-data.frame()
  for(i in 1:listIndex){
    ReadList<-read_html(paste(listUrl,"?page=",i,sep=''))    #Titles, authors, and hyperlinks from browsing the given list.
    title%html_nodes("td")%>%html_nodes('.bookTitle span')%>%html_text()
    authors%html_nodes("td")%>%html_nodes('.authorName span')%>%html_text()
    hyper%html_nodes("td")%>%html_nodes('a.bookTitle')%>%html_attr("href")    #Extract Goodreads book ID from url, which can be used as a key variable. Second gsub() is used to catch an occasional exception
    goodreadsID<-gsub(".*/book/show/\\s*|-.*", "", hyper)
    goodreadsID%as.numeric()    a<-cbind(goodreadsID,title,authors,hyper)
    data<-rbind(data,a)
    complete<-i/listIndex
    cat("Completion (Step 1): ",format(percent(complete),digits=4,justify="left"),"\n","Works Found: ",nrow(data),"\n")
  }  cat("Step 1 Complete! Found ",nrow(data)," separate works in this list.", "\n")
  # Section 2b: Scrape the page of individual books in the list. This gets additional information: page lengths, 
  # the genre tag (with the most votes by readers), text summaries etc. 
  # This step is takes time! (Technically I could speed this up with the doParallel package and run it on multiple cores,
  # however I would not be able to display program progress updates. I decided I'd rather be able to keep close tabs
  # on how the program is doing.)  badurls=0
  total_ratings<-rep(NA,nrow(data))
  book.descriptions<-rep(NA,nrow(data))
  pageCounts<-rep(NA,nrow(data))
  genreVoted<-rep(NA,nrow(data))
  hasAward<-rep(NA,nrow(data))
  average_rating<-rep(NA,nrow(data))
  total_ratings<-rep(NA,nrow(data))
  text_reviews<-rep(NA,nrow(data))
  proportion_text_reviews<-rep(NA,nrow(data))
  for(i in 1:nrow(data)){
    #The url of a book in the target list
    url<-paste('https://www.goodreads.com',data$hyper[i],sep='')
    go<-tryCatch(read_html(url),
                 error=function(c) 'stop')
    #So it doesn't crash if it is a bad url
    if(go!='stop'){
      #the html page of a book in the target list
      goodReads<-read_html(url)      # Number of total reviews and proportion that are also text reviews
      text_reviews[i]<-find.text.reviews(goodReads)      total_ratings[i]<-find.total.reviews(goodReads)      proportion_text_reviews[i]<-(text_reviews[i]/total_ratings[i])      #Get the summaries of each book
      book.descriptions[i]<-find.description(goodReads)
      #Average Rating
      if(length(html_nodes(goodReads,'.average'))!=0){
        average_rating[i]<-as.numeric(html_text(html_nodes(goodReads,'.average')))
      } else{
        average_rating[i]<-NA
      }      hasAward[i]<-check.for.awards(goodReads)      genreVoted[i]<-find.genre.tag(goodReads)      pageCounts[i]<-find.page.length(goodReads)
    }
    else{
      badurls=badurls+1
    }
    complete<-(i/nrow(data))*100
    cat(paste("Completion: ",format(complete,digits=4,justify="left"),"%","       Title: ",substr(data$title[i],1,45),sep=''),'\n',
        "                       Last Captured: ",format(average_rating[i],width=3,justify="centre",nsmall=2),format(pageCounts[i],width=6,justify="centre"),format(hasAward[i],width=5,justify="centre"),format(genreVoted[i],width=15,justify="centre"),format(percent(proportion_text_reviews[i]),width=5,justify="right"),'\n')
  }
  newData<-cbind.data.frame(pageCounts,proportion_text_reviews,average_rating,genreVoted,hasAward,total_ratings,book.descriptions)
  Finished<-cbind(data,newData)
  return(Finished)}#---------------------------------------------------------------------
#-----      Section 3: Actually webscraping some lists      ----------
#---------------------------------------------------------------------
# Usage is quite simple, with only one argument.
# Argument: url of a list on goodreads.com
# Output: returns a dataframe with information scraped from each work on the list. # Comments: The scraper isn't perfect, and appears to have issues scraping certain works. I have number of exceptions 
# to handle this, but it does still encounter issues.
# Sometimes this is expected: 
  # Audiobooks don't have page lengths
  # Lesser known works sometimes do not have a genre tag, since no one voted on a tag.
# Also be careful feeding very large lists into the function! I'd say ~2000 books is ideal, anything further and you
# may run into issues (and can expect to wait some time for it to finish).#bestEpicFantasy<-goodReads.webscrape('https://www.goodreads.com/list/show/50.The_Best_Epic_Fantasy')#excellentSpaceOpera<-goodReads.webscrape('https://www.goodreads.com/list/show/1127')#bestScienceFiction<-goodReads.webscrape('https://www.goodreads.com/list/show/19341')#apocalyptic<-goodReads.webscrape('https://www.goodreads.com/list/show/47')#travel<-goodReads.webscrape('https://www.goodreads.com/list/show/633.Favourite_Travel_Books')#bestScience<-goodReads.webscrape('https://www.goodreads.com/list/show/692.Best_Science_Books_Non_Fiction_Only')

Webscraper Overview (Goodreads Part II)

Argument

Output

Postscript

Written by Connor Higgins