Scraping 100+ Free Data Science Books with Python

And using data science to decide which books to read.

Benedict Neo
Photo by Patrick Tomasso on Unsplash

Data is information, and having enough information is crucial for making the right decisions. How does one procure data easily from the web? The answer is web scraping.

Web scraping is an essential skill for data scientists to procure the data they need easily. Machine Learning algorithms and experiments require enough data to learn and generalize well on a specific problem. So often enough, data scientists need to get more data to improve models and experiments.

Some popular use cases of web scraping include using it for business intelligence, regulating prices, customer satisfaction using sentiment analysis, and more.

It’s clear web scraping is a powerful tool. However, it’s also recommended to adhere to the best practices for web scraping. Here’s a great article on avoiding being blocked when web scraping.

Scraping 100+ data science books

In this article, I will be scraping the article “100+ Free Data Science Books” from the website, which contains many useful resources for learning data science.

I’ll be using the Beautiful Soup library, a popular library for web scraping. And the data science libraries to transform and visualize the data and gain insights from it.

Why scrape this article? The goal is to decide which book to read from the huge list of 100 books based on overall rating and total amount of ratings.

As always, here’s where you can find the code for this article:

Before we dive into web scraping, bitgrit's latest competition, Viral Tweets Prediction Challenge, is ending soon on July 6, 2021!

If you want to apply your data science knowledge to a real-world problem and win cash prizes up to $3000 💵 , sign up for free now! It's a good learning experience, and you have nothing to lose from participating.

If you're a beginner and don't know how to get started, read our recent article — Using Data Science to Predict Viral Tweets — to guide you step-by-step to build a simple model for this competition.

Now let’s start scraping.

Observing the HTML of the books

When you want to scrape something from the internet, you always start by observing what you want to scrape.

In the article, here is how the books are presented.

image of the book from the website

And this is what it looks like in HTML.

screenshot of HTML of the website

From the inspect tool, we see all the books are within the id BooksWrapper

Each book is within a section class which has the information we need in specific tags:

  • <div class=“star-ratings”> — Goodreads rating and amount of ratings
  • <div class=“book-cats”> — Book category
  • <h2> — Book title
  • <div class=“meta-auth”> — author name, year
  • <p> — book description
  • <a class=”btn”.. >— book link and amazon review link

Now we have an idea of which class and tag to tackle, we can start coding!

Importing libraries

As always, we start by importing the libraries we need.

  • urllib.request — used to open our website and return HTML data.
  • bs4 — Beautiful soup library, the star of the show, helps us extract the right data from HTML.
  • wordcloud — create word cloud plots for our text data analysis
  • re — python regular expression library

After we have our libraries, we can start creating our beautiful soup object.

Getting data with bs4

Using the urlopen function and passing in the URL, we get our HTML data. After that, we create a bs object using the lxml parser. You can also use other parsers as long as it works in your particular case. Read here for the difference between parsers.

Title of our website

Getting HTML of a single book

Calling find_all on the section tag, and getting the first occurrence (the first book), you can see it is exactly what it looks like from the inspect page.

Notice I set the attributes (attrs) as empty because there was an ad in one of the section tags which looks like this → <section class = “ad-block”>. Since all the other section doesn’t have a class name, doing so prevents my find_all function from getting the ad.

Searching with bs4

Note that this wasn’t a comprehensive use of bs4, as I’m only doing basic web scraping, so I only touched on some of the searching functionalities on bs4

Here are the methods I used:

  • soup.find() — first occurrence of class/tag
  • soup.find_all() — all occurrences of class/tag
  • soup.find().find() — searching within a class/tag
  • .get_text() — returns the text of the HTML tag
  • .prettify() — pretty output of HTML

You can find more functions for searching on the Beautiful Soup documentation.

Getting all the information we need

We already observed which tag we need to get the necessary information, which are:

  • book rating
  • the total amount of ratings
  • book category
  • book title
  • author-name
  • book description
  • book link
  • Amazon review link

So let’s write the code needed to get each of this information.

Most of the information was easy to obtain using find() and get_text() but some required more extraction using python to get the exact info we want.

  • for total_ratings, the information was like this → (342 Ratings), but we only want the amount. Using re, we can pass in a regex condition \d+ , which means “match any digit (\d) repeatedly (+)”. This will return a list, so we can call the group function to get our number 342.
  • for author and year, the information gives us the author separated by year like this → Stuart Russell, 1995. To separate this into author and year, we can first split this text by commas using split(‘, ’) and use python’s tuple functionality by passing the result into author and year. However, this is a naive method because later on, there will be some cases that we didn’t consider that will break this method.
  • For the links, we only want the link itself, so using get(‘href’) can easily help us do that.

As it is with coding, when we use our methods on more data, we will encounter unexpected problems that we did not account for. And to resolve that, we need to revise our code.

Dealing with missing components

The problems, in this case, are missing components from our book information. The issues are:

  • books without year & multiple authors
  • books without rating
  • books without review links
  • books without description

Note there are different ways you can deal with these issues. Below is my way of dealing with them.

books without year & multiple authors

For the first issue, here are three books that will cause problems with our initial code.

As you can see, if we split by comma only, we won’t be able to get the year for books with multiple authors, and for the case of book35, we will be getting “Syracuse University” which is definitely not a year.

To resolve this, what we can do is first search for whether the text has digits. If it does, then only we perform the split. After that, we still need to account for multiple authors, so after splitting, the year will be the last element in the list, and what we can do is grab the last element using [-1] in Python.

If we don’t get a digit, we simply set the year as None.

books without rating

Moving on to books without a rating, we chose book23 which has no rating.

You can observe that calling find for that particular book shows that there is no information within the div tag.

Since bs4 find already returns None if there’s nothing in the tag, we can just set a condition for searches that don’t give None, and use back the code we had before.

Printing the rating and total_rating, we see that it’s None now

Books without review link

For books with both book link and review link, find_all('a')will return two links. Book8 doesn’t have a review link, so its length is only one.

Since all books have a book link, we only have to check whether the length is 2. If it is, we get review_link. Else, we set it to None.

For book8 without a review link, you can see it returns None now.

Books without description

For books without description, the find function already solves it for us since it returns None if it doesn’t exist.

After dealing with all those issues, we can start storing our data and build a pandas data frame.

Storing and building our data frame

First, we can create a list object for each of our information to just append it to these lists later on.

Get book info function

To get the information from each book, I created a function, placed the code for getting each information in the function, and append them to their respective list.

Note this was a quick and dirty way to get my data. There are better ways to structure the code and make it cleaner, but it works, and that’s what's important for now.

Using our function, we can iterate each book within books, which is the bs object that contains all the book information.

Building our Pandas data frame

With all our information in lists, we can build our Pandas data frame!

Calling info on our data frame,

We can see that we only have 97 books, which either means the title is wrong or our scraping had some issue, but no worries. We also see the data type is all object, which we’ll have to fix later.

But first, let’s clean the data.

Data Cleaning

Remember we set values as None previously in our code; they are the missing values in our data frame.

What can we do to replace these missing values? Here’s what I thought of:

  • book_cat — check the book, and impute it ourselves manually since it’s only one book
  • year — leave it empty for now
  • rating — replace with 0.0
  • total_ratings — replace with 0
  • description & review_link — replace with "None"

If you want to go a step further, you can create a script that iterates over the book, query the title on sites like Amazon or Goodreads, then grab the information you’re missing. I won’t do it here, but you’re welcomed to try that out!

If you want to brush up your data cleaning skills, check out our Data Cleaning using Python article.

Let’s start with the book with the missing category!

We can get the specific column where book_cat is null. We can also bring up the categories and figure out which is suitable for the particular book.

Since it’s under Artificial Intelligence, I chose to replace it with that.

After replacing the missing values in the other columns, we only have the year column with missing values, which I’ll leave empty.

Data Transformation

Next up, we transform the data type of some of our columns.

Columns to convert

year → datetime
rating → float
total_rating → integer

Pandas has a useful function convert_dtypes() which converts columns to best possible dtypes. It’s not very useful for our case since all our data types are objects, but this will convert all our columns to strings.

Then for year, we convert it to Int64, which can support NA values. We do the same for rating and total_ratings.

Now our data is ready, and it’s time to visualize it!

Exploratory Data Analysis

Let’s visualize our data and see if we can find anything interesting from these 100 books.

For text data, I decided to build a plot word cloud function.

Word cloud of book titles

By joining all the texts in the title column, we can calculate how many individual words there are.

With the text ready, we can plot our word cloud.

book titles word cloud

We see that the words Python and Data are predominant, with Machine Learning and Learning coming in close. This makes sense since Python is the most popular language for Data Science.

Word cloud of book descriptions

There are over 990 individual words in the book descriptions.

Let’s see what we get when we plot the word cloud.

This column had “None” imputed, so let’s add it to our list of stop words.

book description word cloud

We see the word data is highly prevalent, along with the word book and programming. The words Python and introduction is also coming in close. This suggests the books we scraped are mostly introductory programming books related to data and are in the language Python.

Book category

Our books had many categories; which one was the most common?

From our plot, we see Data Mining and Machine learning was the most common, with Learning Languages coming in second.

Book year

We can also plot the year count and find out what year the books are most commonly released.

From our plot, we see the year 2015 was the most common in our list of 100 books, exactly 18 books.

Book rating and total ratings

Since the rating and total_ratings column had high amounts of missing data, 40 out of 97 rows (almost 50%), we can expect the data to be quite skewed.

Calling describe on the columns, we see 4.6 is the highest rating for the books, and the max amount of rating is 1659.

Plotting a histogram of the columns:

Considering we have quite a lot of missing data and the fact we have very little data, we see that the distribution is pretty skewed.

This is more evident when we plot a boxplot.

For total_ratings, notice how three points are extreme outliers.

We can also plot both these columns together with a strip plot.

We see that there are three books with a high rating and total number of ratings.

Which book to read?

Let’s say you stumble upon these 100 books, and you have no idea what to read; why not look at the rating and total ratings to help you decide? Just like movie ratings, we can usually trust the consensus on whether the book is of good quality.

By scraping the data, visualizing it, we can gain insights that can help us make decisions, which is essentially what Data Science is.

Which are the three books with high total ratings and ratings from our plot?

Based on the last plot, we saw three data points with total ratings above 1500 and ratings above 4.0 (or 4.2 to be more exact).

Voila! The three books are:

  • Automate the Boring Stuff with Python
  • An Introduction to Statistical Learning
  • Pattern Recognition and Machine Learning

What are the top 10 books in total ratings?

We can sort our data frame based on columns, so let’s see what the top 10 books are in total ratings.

Aside from the top three books earlier, we see some more popular books in data science, mainly NLP with python, Python for Everybody, and Artificial Intelligence A Modern Approach.

What are the top 10 books in terms of total rating and rating?

How about sorting by rating and total_rating?

We see that some new books popped up on the list. However, some of the books have few ratings. For example, the book Elementary Differential Equations has only 5 ratings, and it’s hard to say whether we can trust the book is good or not.

Can you trust the results?

To be , our data set is tiny. With only 100 books, plus around 40 ratings and total ratings being missing, our results will be biased. To make sure we make the data-driven decision, we should increase our sample size and impute the missing data with more scraping.

If you want to take this even further, you can also calculate the weighted rating of the books the same way IMDB ranks their top films.

Interactive data table.

If you’re in Google Collab, you can run this command to get an interactive table and do more exploration like sorting each column, filtering, etc.

You now have 100 books in a data frame.

You can decide to export it into a CSV file like this.

A cool thing you can do with this CSV file is you can iterate over the book_link column, and download all 100 books to your computer.

Thanks for reading!

That’s all for this article, and I hope you got a glimpse of the power of web scraping!

