Web-scraping tables in Python using beautiful soup

Thiago Santos Figueira
Geek Culture
Published in
6 min readMay 27, 2021
Photo by Piotr Miazga on Unsplash

It is not always that we have access to a neat, organized dataset avaliable in the .csv format; sometimes, the data we need may be available on the web, and we have to be capable of collecting it. Luckily for us, Python has a solution in the form of the package Beautiful Soup.

We should start by making the library available in our environment.

pip install beautifulsoup4

From the documentation, we learn that:

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

Today, we will look at datasets that are formatted as tables in HTML. Before we move on, I would like to give you brief reminder of the core structures of these tables.

Table structure in HTML

I understand not everyone is familiar with HTML; if nothing else, the image below is a good reminder of the basic structure of HTML tables.

Observe that each table row (TR) has one or more table data (TD). This means that we can iterate over each row, then extract each column data.

Now, let us have a look at the data we will scrape. I chose two datasets to demonstrate different approaches using the beautiful soup library. The first one is the Manaus neighborhood list; the second is the Toronto neighborhood list (a part of it).

1. Manaus neighborhood list

To give you some context, Manaus is a city of the state of Amazonas in Brazil. The image below shows one of its postcards: the Teatro Amazonas (Amazon Theatre).

Photo by Rivail Júnior on Unsplash

The screenshot below shows the first few rows of our first dataset. It is avaliable in this Wikipedia page.

There is a total of 63 neighborhoods in Manaus.

The column names are in Portuguese, which is the native language of Brazil. Let us understand what each column represents in English:

Notice neighborhoods are organized in zones (South, North, East, South-Center, etc.). Some are larger than others in total area size and in demographic density.

Let us begin the data collection!

# Importing the required libraries
import requests
import pandas as pd
from bs4 import BeautifulSoup

After importing the necessary libraries, we have to download the actual HTML of the site.

# Downloading contents of the web page
url = "https://pt.wikipedia.org/wiki/Lista_de_bairros_de_Manaus"
data = requests.get(url).text

Then we create a BeautifulSoup object

# Creating BeautifulSoup object
soup = BeautifulSoup(data, 'html.parser')

We now have the HTML of the page, so we need to find the table we want. We could retrieve the first table available, but there is the possibility the page contains more than one table, which is common in Wikipedia pages. For this reason, we have to look at all tables and find the correct one. We cannot advance blindly, though. Let us have a look at the structure of the HTML.

Indeed, there is more than one table. In the image above, the highlighted table is the one we want to collect. Unfortunately, the tables do not have a title, but they do have a class attribute. We can use this information to pick the correct table.

# Verifying tables and their classes
print('Classes of each table:')
for table in soup.find_all('table'):
print(table.get('class'))

OUTPUT:
Classes of each table:
['box-Desatualizado', 'plainlinks', 'metadata', 'ambox', 'ambox-content']
['wikitable', 'sortable']
['nowraplinks', 'collapsible', 'collapsed', 'navbox-inner']

Our piece of code tells us we want the second table (aka. class = ‘wikitable’ and ‘sortable’).

# Creating list with all tables
tables = soup.find_all('table')

# Looking for the table with the classes 'wikitable' and 'sortable'
table = soup.find('table', class_='wikitable sortable')

Notice that we do not need to use commas while passing the classes as parameters. Once we have the correct table, we can extract its data to create our very own dataframe.

# Defining of the dataframe
df = pd.DataFrame(columns=['Neighborhood', 'Zone', 'Area', 'Population', 'Density', 'Homes_count'])

# Collecting Ddata
for row in table.tbody.find_all('tr'):
# Find all data for each column
columns = row.find_all('td')

if(columns != []):
neighborhood = columns[0].text.strip()
zone = columns[1].text.strip()
area = columns[2].span.contents[0].strip('&0.')
population = columns[3].span.contents[0].strip('&0.')
density = columns[4].span.contents[0].strip('&0.')
homes_count = columns[5].span.contents[0].strip('&0.')

df = df.append({'Neighborhood': neighborhood, 'Zone': zone, 'Area': area, 'Population': population, 'Density': density, 'Homes_count': homes_count}, ignore_index=True)

Notice that we first create an empty Dataframe, but we give it its column names. Then we find all rows; for each row, we want all data. Once we have the data, we can use indexes to reference each available column. We must look at the HTML structure to use the correct references in the extraction process. In this example, some columns had the HTML tag span and needed additional stripping for strange characters. Let us see what our Dataframe returns.

df.head()
Output of the head call

Incredible! We are looking at the data we extracted from the Wikipedia page. Here is a pro-tip: Pandas has a method for extracting HTML pages without much effort.

Pro-tip

The method read_html returns a list of Dataframes containing HTML elements that satisfy our attribute specifications. In this case, we are looking for a table that includes the classes: wikitable and sortable. The thousands parameter specifies the separator used to parse thousands.

df_pandas = pd.read_html(url, attrs = {'class': 'wikitable sortable'},  flavor='bs4', thousands ='.')

Let us have a look at the dataframe.

df_pandas[0].head()
Pandas extracted this Dataframe for us

2. Toronto Neighborhood List

Like before, let us have a look at the data first.

Unlike the first dataset, this one is not organized in rows and columns. Instead, the data is grouped together under one column that indicates the postal code. Let us look briefly at the HTML structure of the page.

Notice two things here. First, some columns are empty and display the message ‘Not assigned’. Second, each column has a paragraph (tag p) and a span (tag span). Let us begin our collection process.

# Importing libraries
import
requests
import pandas as pd
from bs4 import BeautifulSoup

After importing the necessary libraries, we download the HTML data.

# Downloading contents of the web page
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
data = requests.get(url).text

We create the BeautifulSoup object.

# Create BeautifulSoup object
soup = BeautifulSoup(data, 'html5lib')
# Get table
table = soup.find('table')

Notice, in this case, we can find the table directly because there is only one table on the page.

contents = []

# Getting all rows
for row in table.find_all('td'):
cell = {}
if row.span.text == 'Not assigned':
pass
else:
cell['PostalCode'] = row.p.text[:3]
cell['Borough'] = (row.span.text).split('(')[0]
cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
contents.append(cell)

In this dataset, we explored each column (tag td). We reject empty columns and extract the contents from the paragraph and the span. Finally, we add the cell to the list of contents.

# Creating the dataframe
df = pd.DataFrame(contents)

# Changing some values to more comprehensive names
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

We create the dataset by passing the contents list to the Pandas method Dataframe. Besides, we shortened the name of some rows in the Borough column.

# Visualizing dataframe
df.head()

Success! We extracted the dataset, as we desired.

Thank you for reading! You can find the code for these projects in the following repository: https://github.com/TSantosFigueira/Coursera_Capstone

Photo by Marco Bianchetti on Unsplash

--

--