Scraping Tabular Data With Beautiful Soup

Josh Mancuso
Future Vision
Published in
5 min readApr 25, 2019

Want cool Future Vision Merch? Check out our store here

For my recent project at Lambda School, I analyzed the effect of different types of weather on NFL QB passing performance. I found weather data for NFL games here at this website. As you can see, weather data for each game is presented in a nice clean table, like this:

Great- but how do we get this data into a csv spreadsheet file so we can read it into a pandas dataframe? Well, we could just highlight the data with our mouse and manually copy paste into our spreadsheet- but with over 10,000 games of weather data available, that could take quite a while! Instead, we can just import the Beautiful Soup library in python and let our computer do the work for us.

First, we import pandas, beautiful soup, and requests libraries, and we pass in the game data url to requests.get(). This grabs the raw html information for that webpage, and we can store this object in a variable we’ll name result. Next, we convert the result.text into a beautiful soup object, which we can more easily manipulate and isolate the different elements of the webpage that contains the data we want to scrape. Here’s a quick peek at the code and the resulting soup object:

The soup object is still pretty messy looking- all of the html data is just one long string. However, with a quick scroll, we can see that the data we want to scrape is contained in html tables. We can use the powerful ‘find_all’ method on our soup object and pass in the ‘table’ label, which will grab every item from this webpage that is contained in a table.

This tables object can be accessed via indexing. Here we see the data for which teams played the game is contained in line 4, so we index tables[3], and call string.strip() in order to remove the html <tags> from before and after the information we are trying to grab.

Next, we look at the 5th row of tables, which contains the date, city and stadium information. We grab that desired info with similar indexing and with some basic python string manipulation, like this:

Finally, row 6 is where the actual weather data is contained. When we inspect tables[5], we see that our column labels for the weather data are all contained in html table headers, meaning their html label is <th>. Great, we can grab them, iterate over them and strip away the tags, and store them in a list, like this:

Excellent, the final bit of information we need to grab is the actual weather data that fills in the cells of the table. That data is also contained in tables row 6, and each cell is tagged as table data, or <td>. There are 10 columns of weather data, and 6 rows, however the last 2 rows are just MIN and MAX values for each column (which we can easily access with pandas later if needed), so we’ll ignore those for now and only grab the first 4 rows. So we’ll need to grab the first 40 cells (10 columns x 4 rows).

Still a bit messy, but we’re almost home: The last step is just iterating over the 40 items in our cell data, 10 at a time so we take each row as a chunk, perform string manipulation and append our results to a list, like so:

try/except clause is for cells that are blank, which throw an error: here we fill them in with default value of ‘ — ‘

Finally, we combine out game_data (list of the weather data, by row) with our columns into a pandas dataframe and inspect the finished product:

Nice, we’ve got all the information we wanted in a pretty clean pandas dataframe. In order to get the weather data for the rest of the season or however many years back we want, we can use the same process we just used to scrape the urls for all the other games into a list. All the python code we used to scrape this single games weather data can be combined into a function which accepts a game_url as a parameter. Once you have the urls for other games, just iterate over the list of those urls and pass each into the function which returns a pandas dataframe like the one pictured for that game. You can append these games to a list, and once the iteration is complete, quickly and efficiently concatenate those game dataframes into one master dataframe.

If you are new to web scraping, I recommend trying your hand with BeautifulSoup on a webpage of your choosing. Also, if you want to take a look at the full notebook for this blog post, you can check it out here.

--

--