How To Scrape Wikipedia Tables Straight To Pandas DataFrame(in 3 lines of code)
Wikipedia is an invaluable source of information, perhaps not always the most accurate due to its open-source nature, nonetheless, it has a lot of useful information and is very convenient. Especially the lists. I have personally used Wikipedia lists many times in the data pipeline for my machine learning projects. (Most recently for this company names/tickers annotator)
So, in this small write-up, I’ll cover how you can easily scrape the list tables from Wikipedia and convert them into pandas DataFrames. And for the purpose of this tutorial, we will scrape the list of S&P500 companies from here.
If you’ve never done this before, you would probably think of using modules like requests and BeautifulSoup to fetch and parse the webpage before creating a pandas Dataframe. And there’s is nothing wrong with that approach per se. It’s just that the read_html()
function does all that in one line 😬
Enough talking, let’s get into it!
Scraping HTML Table Straight To Pandas DataFrame
Let’s import pandas and fetch the HTML tables from the S&P500 Wikipedia page:
import pandas as pdtables=pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
And, that’s it! You already have a list of DataFrames ,tables
, containing the two tables present on that webpage. By default, the function fetches all tables from the webpage and returns a list of DataFrames.
Let’s print out the head of the first DataFrame to verify if it’s the real thing.
print(tables[0].head())
It is!
Three Useful read_html()
Parameters You Should Know About
Explicit Typecasting with converters
We do have the DataFrame, but it’s not perfect. By default, the columns containing only numerals are cast to numeric types, for example, the CIK column. But not all numerical text data are necessary to be numeric types, for instance, our CIK column has values with leading zeros.
CIK = 0000066740
Besides that, sometimes you may want to ensure dtype
integrity. For these requirements, we can explicitly typecast with the converters
parameter:
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies", converters={'CIK': str})print(tables[0].head())
Filtering tables with match
read_html()
has a parameter match
that takes a string or a regular expression. The default value is .+
(match any non-empty string) and returns all tables. Let’s see how this works by filtering out the second table on the page.
Only the first table contains the ‘Headquarters Location’
column. So let’s use that to read it:
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies", parse_dates=['Date first added'], match='Headquarters Location')print(len(tables))### OUTPUT
# 1
Selecting Tables with attrs
Another way of selecting tables from the webpage is the more conventional scraping method of using HTML tag attributes like class
or id
. The parameter attrs
takes a dictionary of any valid HTML tag attributes for filtering tables. Let’s use it to selectively fetch the first table on the webpage. First, we’ll need to find a tag attribute that differentiates it from the second table:
The class
tag attributes are shared by the two tables so we’ll have to use the id
tag.
tables = pd.read_html(url, attrs={'id':"constituents"}, converters={'CIK': str})print(len(tables))### OUTPUT
# 1
Conclusion
The Pandas read_html()
function is a quick and easy way for scraping data from HTML tables. I hope this article helps you to save time while scrapping tables from web pages. I recommend checking out the documentation for the read_html()
function to learn about other handy parameters like parse_dates
and to know about other things you can do with it.
Thanks for reading!