Goodreads — Web Scraping

Akriti Sood
7 min readJun 30, 2022

--

A mug of coffee and my favorite novel, an ideal day for me. But deciding which novel to read is a tedious job, I always go for Goodreads for recommendation and reviewing of the novel.

Goodreads is one the world’s largest community for reviewing and recommending books. As a voracious reader, Goodreads is one of my favorite platforms.

This article will help to scrap the multiple pages from the website. I have chosen “All Time Favorite Romance Novels” URL to retrieve information, you can chose any other URL.

Introduction:

The internet has massive amount of data, to access the data web scraping is used.

Web Scraping : It is extraction of data from the websites. The information is collected and then exported into the format useful to the user.

There are various ways by which we can scrap the data:

  1. Auto Scraper

2. Selenium

3. Beautiful Soup

In this article, I am using Beautiful soup to extract the information.

Prerequisites:

I am using anaconda software to scrape the data. The prerequisites to scrape the data are:

  1. Python 3
  2. BeautifulSoup
  3. Requests
  4. Pandas

Steps used to scrap the data:

Step 1: Install and load the packages in your notebook.

pip install requests
pip install beautifulsoup4
pip install pandas
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

Step 2: Get the URL of the website you want to extract

from requests import get
url = 'https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=1'
response = get(url)
print(response.text[:500])

Not all websites allow scraping of the data. Some websites uses different methods to enable scraping of the data.

As you can see the HTML code, the data can be scraped from this site.

<!DOCTYPE html>
<html class="desktop withSiteHeaderTopFullImage
">
<head>
<title>All Time Favorite Romance Novels (4933 books)</title>

<meta content='4,912 books based on 12098 votes: Pride and Prejudice by Jane Austen, Fifty Shades of Grey by E.L. James, Beautiful Disaster by Jamie McGuire, Twilight b...' name='description'>
<meta content='telephone=no' name='format-detection'>
<link href='https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels' rel='canonical'>



<sc

If you get below response, then data can’t be scraped by this method or website is restricting you to scrape the data.

<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>openresty</center>
</body>
</html>

Step 3: First we are scraping one element from the website and according all the data will be extracted.

  1. Right click on the data you want to extract and click on Inspect.
  2. A window with HTML code will appear on the right side of the webpage.
  3. Find the container in which the data we want to extract is stored.

Example: The data extracted from Goodreads is: book title, author, ratings, votes and average score. The container which consists of all these data.

response = requests.get(url)
html = response.content
html_soup = bs(html, “html.parser”)
book_containers = html_soup.find_all(‘tr’,itemtype=”http://schema.org/Book")
print(type(book_containers))
print(len(book_containers))
<class 'bs4.element.ResultSet'>
100

In the above code, we are finding all the elements with <tr> and item_type=http://schema.org/Book and storing the value in book_container. This implies that the each webpage has 33 books.

4. Display the container for the first book.

first_book = book_containers[0]first_book

This will show the HTML code for the first book. The start <tr> and the end</tr> of the code for the first book. We are going to extract the information for the below HTML code.

<tr itemscope="" itemtype="http://schema.org/Book">
<td class="number" valign="top">1</td>
<td valign="top" width="5%">
<div class="u-anchorTarget" id="1885"></div>
<div class="js-tooltipTrigger tooltipTrigger" data-resource-id="1885" data-resource-type="Book">
<a href="/book/show/1885.Pride_and_Prejudice" title="Pride and Prejudice">
<img alt="Pride and Prejudice" class="bookCover" itemprop="image" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1320399351i/1885._SY75_.jpg"/>
</a> </div>
</td>
<td valign="top" width="100%">
<a class="bookTitle" href="/book/show/1885.Pride_and_Prejudice" itemprop="url">
<span aria-level="4" itemprop="name" role="heading">Pride and Prejudice</span>
</a> <br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/1265.Jane_Austen" itemprop="url"><span itemprop="name">Jane Austen</span></a>
</div>
</span>
<br/>
<div>
<span class="greyText smallText uitext">
<span class="minirating"><span class="stars staticStars notranslate"><span class="staticStar p10" size="12x12"></span><span class="staticStar p10" size="12x12"></span><span class="staticStar p10" size="12x12"></span><span class="staticStar p10" size="12x12"></span><span class="staticStar p3" size="12x12"></span></span> 4.28 avg rating — 3,634,657 ratings</span>
</span>
</div>
<div style="margin-top: 5px">
<span class="smallText uitext">
<a href="#" onclick="Lightbox.showBoxByID('score_explanation', 300); return false;">score: 233,173</a>,
<span class="greyText">and</span>
<a href="#" id="loading_link_382185" onclick="new Ajax.Request('/list/list_book/872724', {asynchronous:true, evalScripts:true, onFailure:function(request){Element.hide('loading_anim_382185');$('loading_link_382185').innerHTML = '&lt;span class=&quot;error&quot;&gt;ERROR&lt;/span&gt;try again';$('loading_link_382185').show();;Element.hide('loading_anim_382185');}, onLoading:function(request){;Element.show('loading_anim_382185');Element.hide('loading_link_382185')}, onSuccess:function(request){Element.hide('loading_anim_382185');Element.show('loading_link_382185');}, parameters:'authenticity_token=' + encodeURIComponent('5UQzxjQUZF39hLL1LuS6shxxDuzj2ILih3aJcmjuCXBt2yKGq2UoXlpFpFzzk4yBdM/W7/O8reRM1JC2LL4A5w==')}); return false;">2,358 people voted</a><img alt="Loading trans" class="loading" id="loading_anim_382185" src="https://s.gr-assets.com/assets/loading-trans-ced157046184c3bc7c180ffbfc6825a4.gif" style="display:none"/>


</span>
</div>
</td>
<td width="130px">
<div class="wtrButtonContainer wtrSignedOut" id="1_book_1885">
<div class="wtrUp wtrLeft">
<form accept-charset="UTF-8" action="/shelf/add_to_shelf" method="post"><input name="utf8" type="hidden" value="✓"/><input name="authenticity_token" type="hidden" value="DU7jJxYlP0GDettaXIEseOm+7t3v2jRR+6vmZppvLSSF0fJniVRzQiS7zfOB9hpLgQA23v++G1cwCf+i3j8ksw=="/>
<input id="book_id" name="book_id" type="hidden" value="1885"/>
<input id="name" name="name" type="hidden" value="to-read"/>
<input id="unique_id" name="unique_id" type="hidden" value="1_book_1885"/>
<input id="wtr_new" name="wtr_new" type="hidden" value="true"/>
<input id="from_choice" name="from_choice" type="hidden" value="false"/>
<input id="from_home_module" name="from_home_module" type="hidden" value="false"/>
<input class="wtrLeftUpRef" id="ref" name="ref" type="hidden" value=""/>
<input class="wtrExisting" id="existing_review" name="existing_review" type="hidden" value="false"/>
<input id="page_url" name="page_url" type="hidden"/>
<button class="wtrToRead" type="submit">
<span class="progressTrigger">Want to Read</span>
<span class="progressIndicator">saving…</span>
</button>
</form>
</div>
<div class="wtrRight wtrUp">
<form accept-charset="UTF-8" action="/shelf/add_to_shelf" class="hiddenShelfForm" method="post"><input name="utf8" type="hidden" value="✓"/><input name="authenticity_token" type="hidden" value="mudzdxpwN0dYB/zLhNf/+bhhkHH9sfQ0NqKtZxtLlhISeGI3hQF7RP/G6mJZoMnK0N9Icu3V2zL9ALSjXxufhQ=="/>
<input id="unique_id" name="unique_id" type="hidden" value="1_book_1885"/>
<input id="book_id" name="book_id" type="hidden" value="1885"/>
<input id="a" name="a" type="hidden"/>
<input id="name" name="name" type="hidden"/>
<input id="from_choice" name="from_choice" type="hidden" value="false"/>
<input id="from_home_module" name="from_home_module" type="hidden" value="false"/>
<input id="page_url" name="page_url" type="hidden"/>
</form>
<button class="wtrShelfButton"></button>
<div class="wtrShelfMenu">
<ul class="wtrExclusiveShelves">
<li><button class="wtrExclusiveShelf" name="name" type="submit" value="to-read">
<span class="progressTrigger">Want to Read</span>
<img alt="saving…" class="progressIndicator" src="https://s.gr-assets.com/assets/loading-trans-ced157046184c3bc7c180ffbfc6825a4.gif"/>
</button>
</li>
<li><button class="wtrExclusiveShelf" name="name" type="submit" value="currently-reading">
<span class="progressTrigger">Currently Reading</span>
<img alt="saving…" class="progressIndicator" src="https://s.gr-assets.com/assets/loading-trans-ced157046184c3bc7c180ffbfc6825a4.gif"/>
</button>
</li>
<li><button class="wtrExclusiveShelf" name="name" type="submit" value="read">
<span class="progressTrigger">Read</span>
<img alt="saving…" class="progressIndicator" src="https://s.gr-assets.com/assets/loading-trans-ced157046184c3bc7c180ffbfc6825a4.gif"/>
</button>
</li>
</ul>
</div>
</div>
<div class="ratingStars wtrRating">
<div class="starsErrorTooltip hidden">
Error rating book. Refresh and try again.
</div>
<div class="myRating uitext greyText">Rate this book</div>
<div class="clearRating uitext">Clear rating</div>
<div class="stars" data-rating="0" data-resource-id="1885" data-submit-url="/review/rate/1885?stars_click=true&amp;wtr_button_id=1_book_1885" data-user-id="0"><a class="star off" href="#" ref="" title="did not like it">1 of 5 stars</a><a class="star off" href="#" ref="" title="it was ok">2 of 5 stars</a><a class="star off" href="#" ref="" title="liked it">3 of 5 stars</a><a class="star off" href="#" ref="" title="really liked it">4 of 5 stars</a><a class="star off" href="#" ref="" title="it was amazing">5 of 5 stars</a></div>
</div>
</div>
</td>
</tr>

5. Extracting book title. Find the book title in the above HTML code.

<a href="/book/show/1885.Pride_and_Prejudice" title="Pride and Prejudice">

We need to extract “Pride and Prejudice”

name = first_book.find('a',class_="bookTitle")name

We are getting the <a> class in which the book title is stored. But we only want name of the book and not the code.

<a class="bookTitle" href="/book/show/1885.Pride_and_Prejudice" itemprop="url">
<span aria-level="4" itemprop="name" role="heading">Pride and Prejudice</span>
</a>

So, we need to use text.strip() to extract the book title

name = first_book.find('a',class_="bookTitle").text.strip()nameResult:
'Pride and Prejudice'

6. Similarly, we will extract rest of the information for the first book.

authors = first_book.find('a',class_="authorName").text.strip()authorsResult:
'Jane Austen'

Extracting scores for the first book.

scoring = first_book.find('span',class_="greyTextsmallTextuitext").text.strip().split()scoringResult:
['4.28', 'avg', 'rating', '—', '3,634,657', 'ratings']

As you can see that ratings and average ratings are stored in an array. But, we need them individually, so we need to extract them from the array.

avg_scores=scoring[0]rates = scoring[4]print("averae sores:",avg_scores)print("ratings", rates)Result:
averae sores: 4.28
ratings 3,634,657

The last thing we need to extract are scores and votes.

voted= first_book.find('span',class_="smallTextuitext").text.strip().split()scores=voted[1]print("scores:",scores)vote=voted[3]print("votes:",vote)Result:
scores: 233,173
votes: 2,358

Step 4: We have successfully extracted data for the first book, but we need to extract the information about all the books across multiple web pages not just one web page.

So, we have total of 50 pages. So, instead of specifying URL each time, we are going to extract the URL simultaneously. You can specify any no of pages.

page = 1
while page != 51:
url = f"https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page={page}"
print(url)
page = page + 1
Result:
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=1
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=2
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=3
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=4
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=5
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=6
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=7
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=8
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=9
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=10
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=11
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=12
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=13
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=14
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=15
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=16
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=17
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=18
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=19
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=20
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=21
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=22
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=23
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=24
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=25
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=26
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=27
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=28
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=29
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=30
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=31
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=32
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=33
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=34
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=35
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=36
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=37
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=38
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=39
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=40
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=41
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=42
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=43
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=44
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=45
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=46
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=47
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=48
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=49
https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page=50

Step 5: We are going to follow the same steps, as we did for extracting first_book. But instead of directly printing, we are going to store the values in an array.

page = 1
names = []
ratings = []
avgscores = []
author=[]
score=[]
votes=[]
while page != 51:
url = f"https://www.goodreads.com/list/show/12362.All_Time_Favorite_Romance_Novels?page={page}"
response = requests.get(url)
html = response.content
soup = bs(html, "html.parser")
book_containers = soup.find_all('tr', itemtype="http://schema.org/Book")
for container in book_containers:
if container.find('td', width= '100%') is not None:
name = container.find('a',class_="bookTitle").text.strip()
names.append(name)
authors = container.find('a',class_="authorName").text.strip()
author.append(authors)
scoring = container.find('span',class_="greyText smallText uitext").text.strip().split()
ascores=scoring[0]
avgscores.append(ascores)
rates = scoring[4]
ratings.append(rates)
voted= container.find('span',class_="smallText uitext").text.strip().split()
scores=voted[1]
score.append(scores)
vote=voted[3]
votes.append(vote)
page = page + 1

This code will extract all the information till page 50. As we are extracting large amount of information, it will take few minutes to retrieve the information.

As we have extracted the information, we can check whether the information is correct or not.

namesResult:
['Pride and Prejudice',
'Fifty Shades of Grey (Fifty Shades, #1)',
'Beautiful Disaster (Beautiful, #1)',
'Twilight (The Twilight Saga, #1)',
'Perfect Chemistry (Perfect Chemistry, #1)',
'The Notebook (The Notebook, #1)',
...]

Step 6: We need to convert the above arrays into data frame, so that its convenient to use and read.

df = pd.DataFrame({'book title': names,
'ratings': ratings,
'avg_score': avgscores,
'author': author,
'score': score,
'votes':votes
})
df

Step 7: We have successfully retrieved the data, you can also convert the data frame into csv file for further uses.

import os 
os.makedirs(‘directory_name’, exist_ok=True)
df.to_csv(‘directory_name’)

Please insert name of the directory where you want to save your file, replacing “directory_name”.

This looks difficult and messy at starting. But you will enjoy it, once you get hang of it.

Conclusion:

Retrieval of large datasets from webpages can be easily done due to web scraping. You can try different websites and attempt to extract the information. But, do not forget not all sites give permission to scrap the data. The easiest sites to scrap the data are : flipkart, amazon, imdb etc. Do try to extract the data of these sites on your own and soon you will be an expert.

--

--

Akriti Sood

Data Scientist with strong mathematics and involved in various data science projects and have deep passion and expertise in deep learning and neural networks.