How To Scrape Wikipedia Tables Straight To Pandas DataFrame(in 3 lines of code)

Published in

AIGuys

3 min readMar 28, 2022

Wikipedia is an invaluable source of information, perhaps not always the most accurate due to its open-source nature, nonetheless, it has a lot of useful information and is very convenient. Especially the lists. I have personally used Wikipedia lists many times in the data pipeline for my machine learning projects. (Most recently for this company names/tickers annotator)

So, in this small write-up, I’ll cover how you can easily scrape the list tables from Wikipedia and convert them into pandas DataFrames. And for the purpose of this tutorial, we will scrape the list of S&P500 companies from here.

If you’ve never done this before, you would probably think of using modules like requests and BeautifulSoup to fetch and parse the webpage before creating a pandas Dataframe. And there’s is nothing wrong with that approach per se. It’s just that the read_html() function does all that in one line 😬

Enough talking, let’s get into it!

Scraping HTML Table Straight To Pandas DataFrame

Let’s import pandas and fetch the HTML tables from the S&P500 Wikipedia page:

import pandas as pdtables=pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")

And, that’s it! You already have a list of DataFrames ,tables, containing the two tables present on that webpage. By default, the function fetches all tables from the webpage and returns a list of DataFrames.

Let’s print out the head of the first DataFrame to verify if it’s the real thing.

print(tables[0].head())

It is!

Three Useful `read_html()` Parameters You Should Know About

Explicit Typecasting with `converters`

We do have the DataFrame, but it’s not perfect. By default, the columns containing only numerals are cast to numeric types, for example, the CIK column. But not all numerical text data are necessary to be numeric types, for instance, our CIK column has values with leading zeros.

CIK = 0000066740

Besides that, sometimes you may want to ensure dtype integrity. For these requirements, we can explicitly typecast with the converters parameter:

tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies", converters={'CIK': str})print(tables[0].head())

Filtering tables with `match`

read_html() has a parameter match that takes a string or a regular expression. The default value is .+ (match any non-empty string) and returns all tables. Let’s see how this works by filtering out the second table on the page.

Only the first table contains the ‘Headquarters Location’ column. So let’s use that to read it:

tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies", parse_dates=['Date first added'], match='Headquarters Location')print(len(tables))### OUTPUT
# 1

Selecting Tables with `attrs`

Another way of selecting tables from the webpage is the more conventional scraping method of using HTML tag attributes like class or id. The parameter attrs takes a dictionary of any valid HTML tag attributes for filtering tables. Let’s use it to selectively fetch the first table on the webpage. First, we’ll need to find a tag attribute that differentiates it from the second table:

The class tag attributes are shared by the two tables so we’ll have to use the id tag.

tables = pd.read_html(url, attrs={'id':"constituents"}, converters={'CIK': str})print(len(tables))### OUTPUT
# 1

Conclusion

The Pandas read_html() function is a quick and easy way for scraping data from HTML tables. I hope this article helps you to save time while scrapping tables from web pages. I recommend checking out the documentation for the read_html() function to learn about other handy parameters like parse_datesand to know about other things you can do with it.

Thanks for reading!

How To Scrape Wikipedia Tables Straight To Pandas DataFrame(in 3 lines of code)

Scraping HTML Table Straight To Pandas DataFrame

Three Useful read_html() Parameters You Should Know About

Explicit Typecasting with converters

Filtering tables with match

Selecting Tables with attrs

Conclusion

Written by Aditya Singh

Three Useful `read_html()` Parameters You Should Know About

Explicit Typecasting with `converters`

Filtering tables with `match`

Selecting Tables with `attrs`