Scraping 100+ Free Data Science Books with Python
And using data science to decide which books to read.
Data is information, and having enough information is crucial for making the right decisions. How does one procure data easily from the web? The answer is web scraping.
Web scraping is an essential skill for data scientists to procure the data they need easily. Machine Learning algorithms and experiments require enough data to learn and generalize well on a specific problem. So often enough, data scientists need to get more data to improve models and experiments.
Some popular use cases of web scraping include using it for business intelligence, regulating prices, customer satisfaction using sentiment analysis, and more.
Scraping 100+ data science books
I’ll be using the Beautiful Soup library, a popular library for web scraping. And the data science libraries to transform and visualize the data and gain insights from it.
Why scrape this article? The goal is to decide which book to read from the huge list of 100 books based on overall rating and total amount of ratings.
As always, here’s where you can find the code for this article:
If you want to apply your data science knowledge to a real-world problem and win cash prizes up to $3000 💵 , sign up for free now! It’s a good learning experience, and you have nothing to lose from participating.
If you’re a beginner and don’t know how to get started, read our recent article — Using Data Science to Predict Viral Tweets — to guide you step-by-step to build a simple model for this competition.
Now let’s start scraping.
Observing the HTML of the books
When you want to scrape something from the internet, you always start by observing what you want to scrape.
In the article, here is how the books are presented.
And this is what it looks like in HTML.
From the inspect tool, we see all the books are within the id
Each book is within a
section class which has the information we need in specific tags:
<div class=“star-ratings”>— Goodreads rating and amount of ratings
<div class=“book-cats”>— Book category
<h2>— Book title
<div class=“meta-auth”>— author name, year
<p>— book description
<a class=”btn”.. >— book link and amazon review link
Now we have an idea of which class and tag to tackle, we can start coding!
As always, we start by importing the libraries we need.
urllib.request— used to open our website and return HTML data.
bs4— Beautiful soup library, the star of the show, helps us extract the right data from HTML.
wordcloud— create word cloud plots for our text data analysis
re— python regular expression library
After we have our libraries, we can start creating our beautiful soup object.
Getting data with bs4
urlopen function and passing in the URL, we get our HTML data. After that, we create a bs object using the
lxml parser. You can also use other parsers as long as it works in your particular case. Read here for the difference between parsers.
Title of our website
Getting HTML of a single book
find_all on the
section tag, and getting the first occurrence (the first book), you can see it is exactly what it looks like from the inspect page.
Notice I set the attributes (
attrs) as empty because there was an ad in one of the section tags which looks like this →
<section class = “ad-block”>. Since all the other section doesn’t have a class name, doing so prevents my
find_all function from getting the ad.
Searching with bs4
Note that this wasn’t a comprehensive use of bs4, as I’m only doing basic web scraping, so I only touched on some of the searching functionalities on bs4
Here are the methods I used:
soup.find()— first occurrence of class/tag
soup.find_all()— all occurrences of class/tag
soup.find().find()— searching within a class/tag
.get_text()— returns the text of the HTML tag
.prettify()— pretty output of HTML
You can find more functions for searching on the Beautiful Soup documentation.
Getting all the information we need
We already observed which tag we need to get the necessary information, which are:
- book rating
- the total amount of ratings
- book category
- book title
- book description
- book link
- Amazon review link
So let’s write the code needed to get each of this information.
Most of the information was easy to obtain using
get_text() but some required more extraction using python to get the exact info we want.
total_ratings, the information was like this → (342 Ratings), but we only want the amount. Using
re, we can pass in a regex condition
\d+, which means “match any digit (
\d) repeatedly (
+)”. This will return a list, so we can call the
groupfunction to get our number 342.
year, the information gives us the author separated by year like this → Stuart Russell, 1995. To separate this into author and year, we can first split this text by commas using
split(‘, ’)and use python’s tuple functionality by passing the result into author and year. However, this is a naive method because later on, there will be some cases that we didn’t consider that will break this method.
- For the links, we only want the link itself, so using
get(‘href’)can easily help us do that.
As it is with coding, when we use our methods on more data, we will encounter unexpected problems that we did not account for. And to resolve that, we need to revise our code.
Dealing with missing components
The problems, in this case, are missing components from our book information. The issues are:
- books without year & multiple authors
- books without rating
- books without review links
- books without description
Note there are different ways you can deal with these issues. Below is my way of dealing with them.
books without year & multiple authors
For the first issue, here are three books that will cause problems with our initial code.
As you can see, if we split by comma only, we won’t be able to get the year for books with multiple authors, and for the case of
book35, we will be getting “Syracuse University” which is definitely not a year.
To resolve this, what we can do is first search for whether the text has digits. If it does, then only we perform the split. After that, we still need to account for multiple authors, so after splitting, the year will be the last element in the list, and what we can do is grab the last element using
[-1] in Python.
If we don’t get a digit, we simply set the year as
books without rating
Moving on to books without a rating, we chose
book23 which has no rating.
You can observe that calling find for that particular book shows that there is no information within the
Since bs4 find already returns
None if there’s nothing in the tag, we can just set a condition for searches that don’t give
None, and use back the code we had before.
total_rating, we see that it’s
Books without review link
For books with both book link and review link,
find_all('a')will return two links.
Book8 doesn’t have a review link, so its length is only one.
Since all books have a book link, we only have to check whether the length is 2. If it is, we get
review_link. Else, we set it to
book8 without a review link, you can see it returns
Books without description
For books without description, the find function already solves it for us since it returns
None if it doesn’t exist.
After dealing with all those issues, we can start storing our data and build a pandas data frame.
Storing and building our data frame
First, we can create a list object for each of our information to just append it to these lists later on.
Get book info function
To get the information from each book, I created a function, placed the code for getting each information in the function, and append them to their respective list.
Note this was a quick and dirty way to get my data. There are better ways to structure the code and make it cleaner, but it works, and that’s what's important for now.
Using our function, we can iterate each book within books, which is the bs object that contains all the book information.
Building our Pandas data frame
With all our information in lists, we can build our Pandas data frame!
info on our data frame,
We can see that we only have 97 books, which either means the title is wrong or our scraping had some issue, but no worries. We also see the data type is all object, which we’ll have to fix later.
But first, let’s clean the data.
Remember we set values as
None previously in our code; they are the missing values in our data frame.
What can we do to replace these missing values? Here’s what I thought of:
book_cat— check the book, and impute it ourselves manually since it’s only one book
year— leave it empty for now
rating— replace with
total_ratings— replace with
review_link— replace with
If you want to go a step further, you can create a script that iterates over the book, query the title on sites like Amazon or Goodreads, then grab the information you’re missing. I won’t do it here, but you’re welcomed to try that out!
If you want to brush up your data cleaning skills, check out our Data Cleaning using Python article.
Let’s start with the book with the missing category!
We can get the specific column where
book_cat is null. We can also bring up the categories and figure out which is suitable for the particular book.
Since it’s under Artificial Intelligence, I chose to replace it with that.
After replacing the missing values in the other columns, we only have the year column with missing values, which I’ll leave empty.
Next up, we transform the data type of some of our columns.
Columns to convert
Pandas has a useful function convert_dtypes() which converts columns to best possible dtypes. It’s not very useful for our case since all our data types are objects, but this will convert all our columns to strings.
year, we convert it to
Int64, which can support
NA values. We do the same for
Now our data is ready, and it’s time to visualize it!
Exploratory Data Analysis
Let’s visualize our data and see if we can find anything interesting from these 100 books.
For text data, I decided to build a plot word cloud function.
Word cloud of book titles
By joining all the texts in the title column, we can calculate how many individual words there are.
With the text ready, we can plot our word cloud.
We see that the words Python and Data are predominant, with Machine Learning and Learning coming in close. This makes sense since Python is the most popular language for Data Science.
Word cloud of book descriptions
There are over 990 individual words in the book descriptions.
Let’s see what we get when we plot the word cloud.
This column had “
None” imputed, so let’s add it to our list of stop words.
We see the word data is highly prevalent, along with the word book and programming. The words Python and introduction is also coming in close. This suggests the books we scraped are mostly introductory programming books related to data and are in the language Python.
Our books had many categories; which one was the most common?
From our plot, we see Data Mining and Machine learning was the most common, with Learning Languages coming in second.
We can also plot the year count and find out what year the books are most commonly released.
From our plot, we see the year 2015 was the most common in our list of 100 books, exactly 18 books.
Book rating and total ratings
Since the rating and total_ratings column had high amounts of missing data, 40 out of 97 rows (almost 50%), we can expect the data to be quite skewed.
Calling describe on the columns, we see 4.6 is the highest rating for the books, and the max amount of rating is 1659.
Plotting a histogram of the columns:
Considering we have quite a lot of missing data and the fact we have very little data, we see that the distribution is pretty skewed.
This is more evident when we plot a boxplot.
total_ratings, notice how three points are extreme outliers.
We can also plot both these columns together with a strip plot.
We see that there are three books with a high rating and total number of ratings.
Which book to read?
Let’s say you stumble upon these 100 books, and you have no idea what to read; why not look at the rating and total ratings to help you decide? Just like movie ratings, we can usually trust the consensus on whether the book is of good quality.
By scraping the data, visualizing it, we can gain insights that can help us make decisions, which is essentially what Data Science is.
Which are the three books with high total ratings and ratings from our plot?
Based on the last plot, we saw three data points with total ratings above 1500 and ratings above 4.0 (or 4.2 to be more exact).
Voila! The three books are:
- Automate the Boring Stuff with Python
- An Introduction to Statistical Learning
- Pattern Recognition and Machine Learning
What are the top 10 books in total ratings?
We can sort our data frame based on columns, so let’s see what the top 10 books are in total ratings.
Aside from the top three books earlier, we see some more popular books in data science, mainly NLP with python, Python for Everybody, and Artificial Intelligence A Modern Approach.
What are the top 10 books in terms of total rating and rating?
How about sorting by
We see that some new books popped up on the list. However, some of the books have few ratings. For example, the book Elementary Differential Equations has only 5 ratings, and it’s hard to say whether we can trust the book is good or not.
Can you trust the results?
To be , our data set is tiny. With only 100 books, plus around 40 ratings and total ratings being missing, our results will be biased. To make sure we make the data-driven decision, we should increase our sample size and impute the missing data with more scraping.
If you want to take this even further, you can also calculate the weighted rating of the books the same way IMDB ranks their top films.
Interactive data table.
If you’re in Google Collab, you can run this command to get an interactive table and do more exploration like sorting each column, filtering, etc.
You now have 100 books in a data frame.
You can decide to export it into a CSV file like this.
A cool thing you can do with this CSV file is you can iterate over the
book_link column, and download all 100 books to your computer.
Thanks for reading!
That’s all for this article, and I hope you got a glimpse of the power of web scraping!
Here are some resources / tutorials on web scraping
If you like these kinds of articles, be sure to follow the bitgrit Data Science Publication for more!
Liked what you read? Here are some articles you may enjoy: