Web Scraping Wikipedia Tables with Python

Cathrin Rahn
6 min readAug 11, 2023

A simple tutorial to follow

We’ve all been there: You want to start a new data science project and are looking for data. For some things, you might find an API to get the information you need, but how much would you give to be able to access one of the largest free collections of knowledge in the world?

Wikipedia currently has about 6.7 million entries in English alone, on almost every topic imaginable, and with the help of web scraping you can easily access it. In this article, I will focus on web scraping tables in Wikipedia, as they are a good example of how to access content from web pages.

Wikipedia Logo
Harleen Quinzellová, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons

What is web scraping anyway?

What better place to answer that question than Wikipedia:

‘Web scraping is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.’

Wikipedia

So web scraping is a method of collecting unstructured data from web pages on the Internet and transforming it into a structured format. This can be done manually or automatically.

Is web scraping legal?

Unfortunately, web scraping is not always legal. You can usually find information about whether scraping is allowed in the robots.txt file. So, if you want to scrap content from Wikipedia, you can learn more from
https://en.wikipedia.org/robots.txt to find out what the current rules are. In addition, you can usually find the relevant information in the site’s terms of use.

How does web scraping work and what tools do you need?

When you run the code, a request is sent to the URL specified in the code. In response, the server sends the data back and allows access to the HTML page. The code then parses the HTML or XML page, finds the data and extracts it.
Python is the most popular programming language for web scraping. Popular libraries within Python include:

1. Requests: The Requests library allows you to send HTTP requests to web pages and retrieve the HTML content.
2. Beautiful Soup: Beautiful Soup is a library that can parse HTML and XML documents to extract the desired data.
3. Pandas: The Pandas library helps you with data manipulation and analysis after you have extracted the data from the web pages.

How exactly does web scraping work, using Wikipedia tables as an example?

Imagine you have a list of cities and now you need to collect additional data such as the number of inhabitants, the geographical location or the associated state. You can find all this information in the summary table on the Wikipedia page for a city — in this case Paris. But how do you access the data?

Header of the Paris Wikipedia page

1.Import the required libraries: Start by importing the necessary Python libraries, such as Requests and Beautiful Soup.

from bs4 import BeautifulSoup
import requests as r

2.Send a GET request to the Wikipedia page: Use the Requests library to send an HTTP request to the Wikipedia Paris page with the table you want and retrieve the HTML data.

url = "https://en.wikipedia.org/wiki/Paris"
response = r.get(url)
wiki_page_text = response.text

3.Check the response code: Check the response code of the request to rule out possible errors. A status code of 200 means that everything is OK! For more information on error codes and their meaning, click here.

response.status_code

4.Examine the HTML structure of the web page: Open the Inspect tool in your browser (e.g. Google Chrome) and familiarise yourself with the HTML structure of the page. If you hover over an element on the website, in this case the table on the Wikipedia page, you will see the corresponding location in the HTML code.

Tables in Wikipedia are typically structured according to the following scheme:

The entire table is defined in the <table> tag.
The heading is in the <thead> tag.
The data is in the <tbody> tag.
Each row in the table is defined in a <tr> tag.
Each column heading (‘key’) is defined in a <th> tag.
Each data cell (‘value’) in the table is defined in a <td> tag.

You can find out more about HTML here.

HTML structure of the Paris Wikipedia page

As you can see in the screenshot, each column in this table is embedded in a <tr> tag, with each cell representing a separate <td> element.

5.Parse the HTML content: With BeautifulSoup you can now find and extract individual elements within the HTML structure. To find the desired content, you can use various methods such as .find() or .select() and navigate along the HTML structure of the page.

#create a BesutifulSoup Object
soup = BeautifulSoup(wiki_page_text, 'html.parser')
#find the table we are looking for
paris_table = soup.find('table',{'class':'infobox ib-settlement vcard'})
paris_table

4.1.Use .find(): Using the .find() method, you can easily display all content that is tagged with a particular HTML tag.

To do this, first create an empty list in which to collect the content you want. Then use the .find_all() method to find all <tr> tags. Now iterate over each element, find the corresponding <th> (keys) or <td> (value) tag and extract the text. Finally, paste everything into the list you created at the beginning.

#EXTRACTING ALL THE KEYS
keys_list = []
#find all `tr` tags
table_data = paris_table.find_all('tr')
#iterate over each table row to extract the <th>-tag
for i in table_data:
key = i.find_all('th')
keys = [ele.text.strip() for ele in key]
keys_list.append(keys)

#EXTRACTING ALL THE VALUES
values_list = []
#find all `tr` tags
table_data = paris_table.find_all('tr')
#iterate over each table row to extract the <td>-tag
for i in table_data:
value = i.find_all('td')
values = [ele.text.strip() for ele in value]
values_list.append(values)

keys_list, values_list

4.2.Use.select(): A more direct approach is to use the .select() method: You can simply pass CSS selectors as a parameter and the elements will be extracted automatically.

keys_list = []
#you can easily navigate yourself through all the levels in the HTML-code and pass them into the () as a parameter
for s in soup.select('table.infobox tbody tr th'):
keys_list.append(s.get_text())
keys_list

4.3.Combine .select() and .find():

To target specific information, it is effective to combine different BeautifulSoup methods. Here is a practical example: If you want to extract only the country from the information, first navigate through the HTML body of the page using .select(), identify the row with the heading ‘Country’ and add the corresponding text of the cell to the list.

BeautifulSoup provides a number of methods related to .find() that can be used to find and extract items. As shown in the code snippet below, these methods can be used to also handle dependencies between items. Since we know that the column headers in the table are tagged with the <th> tag, we can use .find_next_sibling() to extract information from the related data cells (<td>).

countries_list = []
for s in soup.select('table.infobox tbody tr th'):
if s.text == 'Country':
countries_list.append(s.find_next_sibling('td').get_text())
countries_list

If you want to collect this information for many cities, as in our case, you can add an outer loop that iterates over a list of cities. This could look like this:

#create lists with citynames
cities = ['Lisbon', 'Berlin', 'Paris', 'Rome', 'London', 'Vienna', 'Athens', 'Copenhagen', 'Barcelona', 'Munich', 'Warsaw', 'Prague', 'Marseille']
#create an empty list to store the results
city_countries = []

#iterate over each city in the cities list
for city in cities:
url = f'https://en.wikipedia.org/wiki/{city}'
response = r.get(url)
#check the response's status code
if response.status_code == 200:
#create the soup object
soup = BeautifulSoup(response.content, "html.parser")
#navigate through the HTML-Code
for s in soup.select('table.infobox tbody tr th'):
#search for element with text 'Country'
if s.text == 'Country':
#look for the next element with <td>-tag and get its text
country = s.find_next_sibling('td').get_text()
#append cityname and country to list
city_countries.append((city, country))
break
#print the list of city-country pairs
for city, country in city_countries:
print(f'{city}: {country}')

6.Analyse, clean and process the data: Finally, you can use Pandas to process the data according to your project.

And that’s it! Have fun trying it out!

--

--