Quick Tip: The easiest way to grab data out of a web page in Python
Let’s say you are searching the web for some raw data you need for a project and you stumble across a webpage like this:
You found exactly what you need — an up-to-date page with exactly the data you need!
But the bad news is that the data lives inside a web page and there’s no API that you can use to grab the raw data. So now you have to waste 30 minutes throwing together a crappy script to scrape the data. It’s not hard, but it’s a waste of time that you could spend on something useful. And somehow 30 minutes always ends up being 2 hours.
For me, this kind of thing happens all the time.
Luckily, there’s a super simple answer. The Pandas library has a built-in method to scrape tabular data from html pages called read_html():
It’s that simple! Pandas will find any significant html tables on the page and return each one as a new DataFrame object.
To upgrade our program from toy to real, let’s tell Pandas that row 0 of the table has column headers and ask it to convert text-based dates into time objects:
Which gives you this beautiful output:
And how that the data lives in a DataFrame, the world is yours. Wish the data was available as json records? That’s just one more line of code!
If you run that, you’ll get this beautiful json output (even with proper ISO 8601 date formatting!):
You can even save the data right to a CSV or XLS file:
Run that and double-click on calls.csv to open it up in your spreadsheet app:
And of course Pandas makes it simple to filter, sort or process the data further:
None of this is rocket science or anything, but I use it so often that I thought it was worth sharing. Have fun!
Thanks for reading! If you are interested in machine learning (or just want to understand what it is), check out my Machine Learning is Fun! series too.