Web Scraping with Python using BeautifulSoup
How to parse and extract data from HTML documents in simple steps
One of the main concerns we have when starting a new project is how to obtain the data we’ll be working on. Companies like Airbnb and Twitter, for instance, simplify this task by providing APIs so we can compile information in an organized way. On other occasions, we can download the structured dataset already cleaned and ready to use, as in some Kaggle competitions. However, not rarely, we’ll need to explore the web to find and extract the data we want.
That’s when web scraping comes in handy. The idea is to extract information from a website and convert it for practical analysis. While there are several tools available for this purpose, in this article we’ll be using BeautifulSoup, a Python library designed for easily pulling data out of HTML and XML files.
Here, we’ll visit this Wikipedia page that contains several lists of best-selling books and extract the second table, for books between 50 million and 100 million copies sold.
We only need 2 packages to handle the HTML file. We’ll be also using Pandas to create a data frame from the extracted data:
requests- Allows us to send HTTP requests and download the HTML code from the webpage;
beautifulsoup- Used to pull data out of the raw HTML file;
pandas- Python library for data manipulation. We'll use it to create our data frame.
Extracting the HTML file
To extract the raw HTML file, we simply pass the website URL into the
We now have an unstructured text file, containing the HTML code extracted from the URL path we passed.
Let’s take a look:
requests delivers the HTML code output is quite messy for analysis. That’s when we can get help from BeautifulSoup.
Creating a BeautifulSoup object
Now, we can start to work with BeautifulSoup. Let’s generate a BeautifulSoup object called
soup, passing the
html_text file created above.
Next, we can use a function called
prettify() to shape the object in a structured format.
Notice below how the formatted file is easier to read and work on, compared to when we first generated the raw
Inspecting the Wikipedia page
On the Wikipedia page, let’s inspect the elements of the web page. (In Windows, press Ctrl + Shift + I. In Mac, press Cmd + Opt + I)
Notice that all tables have a class of
wikitable sortable. We can take advantage of that to select all tables in the HTML file.
Extracting the table
We are saving the tables in a variable called
wiki_tables, using the method
find_all() to search for all HTML
table tags, with a class of
As we want the second table on the page (Between 50 million and 100 million copies), let’s narrow down our search to the second
wiki_tables element. Let's also extract each row
tr in that table.
Now, we’ll create an empty list called
table_list, and append the elements of each table cell
We have successfully extracted that second table from the website into a list and we’re all set up to start analyzing the data.
Creating a pandas DataFrame
Finally, we can simply convert the list into a Pandas DataFrame to visualize the data we extracted from Wikipedia.
That’s it! With a few steps and some lines of code we now have a data frame extracted from an HTML table, ready for analysis. Well, there are still some adjustments that could be made, such as removing the square brackets references in the
approximate sales columns, but the web scraping is done!
For the full code, please refer to the notebook.