Crawl Table Data with Just a Few Lines of Code

Published in

The Startup

4 min readNov 9, 2020

At first, “Data Crawling” brought to me the impression of a difficult task that can only be carried out by experts in programming, but after a few hours of researching BeautifulSoup, I am now able to conduct some basic web scraping although I am not that good in technical skills.

Scraping data is a necessary skill, especially if you are working in fields related to data and analytics. When doing different kinds of analysis, you will need to collect enough data to perform your task. Normally, you would search the Internet for CSV files or page sharing APIs to access. However, sometimes the data you need is just a part of the site. At that time, we will need scraping techniques to retrieve data on the targeted sites and convert them into usable format.

In this article, I will introduce you to a simple way to extract table data from Wikimedia with just Python and BeautifulSoup.

Prerequisite

Basic knowledge of Python
Basics understanding of Html tags

Background

I am doing market research in Hanoi to analyze the possibility of starting a business. Therefore, first, I want to find out more about the demographics of each district in Hanoi to have an overview of the city. I decided to crawl this information from https://en.wikipedia.org/wiki/Hanoi and converted it into pandas data frame for more convenient and easier analytics.

**Figure 1: Table data — Information of Hanoi’s districts**

Install necessary packages

from bs4 import BeautifulSoup
import requests
import pandas as pd

Extract data

Go to this website: https://en.wikipedia.org/wiki/Hanoi
Open the page inspection as below:

To get the data from the web page we will use requests API's get() method

url = "https://en.wikipedia.org/wiki/Hanoi"
html_content = requests.get(url).text# Parse HTML code for the entire site
df_soup = BeautifulSoup(html_content, "lxml")

As you can inspect the HTML script, all the table contents (i.e. Subdivisions of Hanoi — our targeted table) are under class Wikitable. Besides, my table is the third table of this page.

a = df_soup.find_all("table", class_= "wikitable")
print("Number of tables on site: ",len(a))#select our wanted table
a1 = soup.findAll("table", class_= "wikitable")[2]

In order to extract the table rows, we have to find all “tr”. Use find_all to get this information

table_rows = a1.find_all("tr")

Get the headings of this table. As we can see from the picture below, all headings are under the tags “th”.

head = table_rows[1]
headings = []
for item in head.find_all("th"):
    item = (item.text).rstrip("\n")
    headings.append(item)

Similarly, it can be seen that the body is under the tag “td”. We also use find_all to get this information

demographics = []
for tr in table_rows:
    district_demo = tr.find_all("td")
    row_demo = [tr.text for tr in district_demo]
    demographics.append(row_demo)

In the end, we combine the headings and body rows and transform them into Pandas data frame for usage

demographics2 = pd.DataFrame(data=demographics,columns=headings)demographics3 = demographics2.loc[3:14].rename(columns={"Provincial Cities/Districts[30]":"Provincial Cities/Districts","Wards[30]":"Wards","Area (km2)[30]":"Area (km2)","Population (2017)[30]":"Population (2017)"}).reset_index(drop=True)demographics3['Population (2017)'] = demographics3['Population
(2017)'].str.replace("\n", "").str.strip() demographics3

Conclude

Web scraping is a must-have skill in the era of data science. It helps us to save time and also money. Although above is just a very simple way of extracting data, I hope it still benefits you in one way or another.