Crawl Table Data with Just a Few Lines of Code

Chi Nguyen
The Startup
Published in
4 min readNov 9, 2020

At first, “Data Crawling” brought to me the impression of a difficult task that can only be carried out by experts in programming, but after a few hours of researching BeautifulSoup, I am now able to conduct some basic web scraping although I am not that good in technical skills.

Scraping data is a necessary skill, especially if you are working in fields related to data and analytics. When doing different kinds of analysis, you will need to collect enough data to perform your task. Normally, you would search the Internet for CSV files or page sharing APIs to access. However, sometimes the data you need is just a part of the site. At that time, we will need scraping techniques to retrieve data on the targeted sites and convert them into usable format.

In this article, I will introduce you to a simple way to extract table data from Wikimedia with just Python and BeautifulSoup.

Prerequisite

  • Basic knowledge of Python
  • Basics understanding of Html tags

Background

I am doing market research in Hanoi to analyze the possibility of starting a business. Therefore, first, I want to find out more about the demographics of each district in Hanoi to have an overview of the city. I decided to crawl this information from https://en.wikipedia.org/wiki/Hanoi and converted it into pandas data frame for more convenient and easier analytics.

Figure 1: Table data — Information of Hanoi’s districts

Install necessary packages

from bs4 import BeautifulSoup
import requests
import pandas as pd

Extract data

Figure 2: Inspect the web page
  • To get the data from the web page we will use requests API's get() method
url = "https://en.wikipedia.org/wiki/Hanoi"
html_content = requests.get(url).text
# Parse HTML code for the entire site
df_soup = BeautifulSoup(html_content, "lxml")
  • As you can inspect the HTML script, all the table contents (i.e. Subdivisions of Hanoi — our targeted table) are under class Wikitable. Besides, my table is the third table of this page.
a = df_soup.find_all("table", class_= "wikitable")
print("Number of tables on site: ",len(a))
#select our wanted table
a1 = soup.findAll("table", class_= "wikitable")[2]
Figure 3: Define table class
  • In order to extract the table rows, we have to find all “tr”. Use find_all to get this information
table_rows = a1.find_all("tr")
  • Get the headings of this table. As we can see from the picture below, all headings are under the tags “th”.
Figure 4: Define headings
head = table_rows[1]
headings = []
for item in head.find_all("th"):
item = (item.text).rstrip("\n")
headings.append(item)
  • Similarly, it can be seen that the body is under the tag “td”. We also use find_all to get this information
Figure 5: Define body
demographics = []
for tr in table_rows:
district_demo = tr.find_all("td")
row_demo = [tr.text for tr in district_demo]
demographics.append(row_demo)
  • In the end, we combine the headings and body rows and transform them into Pandas data frame for usage
demographics2 = pd.DataFrame(data=demographics,columns=headings)demographics3 = demographics2.loc[3:14].rename(columns={"Provincial Cities/Districts[30]":"Provincial Cities/Districts","Wards[30]":"Wards","Area (km2)[30]":"Area (km2)","Population (2017)[30]":"Population (2017)"}).reset_index(drop=True)demographics3['Population (2017)'] = demographics3['Population
(2017)'].str.replace("\n", "").str.strip()
demographics3
Figure 6: Final result

Conclude

Web scraping is a must-have skill in the era of data science. It helps us to save time and also money. Although above is just a very simple way of extracting data, I hope it still benefits you in one way or another.

--

--

Chi Nguyen
The Startup

MSc in Statistics. Sharing my learning tips in the journey of becoming a better data analyst. Linkedin: https://www.linkedin.com/in/chinguyenphamhai/