How To: Scrape the Internet for Data using Python
This is part 1 of 3 in the series working with the 2017 running backs.
You know you’re a geek when a problem comes up and you immediately think of coding the solution.
I’m a huge sports fan, go #WhoDatNation, and one of the best parts of sports is the endless amounts of data that comes with it. Every play in every game in every sport generates more data, which means more data for someone to analyze.
Just yesterday during the Vikings and Packers game, the commentators mentioned the 2017 NFL Draft, and how it produced one of the best draft classes we’ve ever seen. So I spent some time looking online to get the full list of players and see how they’ve done since then. Turns out, there isn’t a simplified list — so let’s make one!
In this How To, we’ll learn how to query the internet using Python and the requests library. We’ll then convert the data into a DataFrame, so that we can use Pandas and other libraries in Python to further manipulate and analyze the data.
What is web scraping? It is, as it sounds, scraping the web for information. Now, this doesn’t mean we go through and try to get the information ourselves; it means we are utilizing some kind of technology to make the retrieving process easier.
We’ll start by picking what information we want to look at, which in our case in some kind of list of draft picks in 2017. Doing a quick search shows us that Wikipedia’s page on the 2017 NFL Draft has a nice and easy to read table, so we’ll use that.
Next, let’s import all of the necessary libraries.
The requests library in Python allows you to send requests to a URI\URL. These requests are typically POST (adding), PUT (updating), GET (retrieving), and DELETE (removing). These are actions that you can take when interacting with a typical API, and the requests library makes it easy for you to interact with said API. For our purposes, we only need to send a GET request to Wikipedia’s webpage.
This line uses the get
method of the requests libraries and returns a Response object, in our case <Response [200]>
. The 200
is a HTTP status code that we got back. 200 OK is a response that indicates a successful request!
The content of the actual GET request is stored in r.content
and is the raw html of the Wikipedia page we queried. We’ll use the content with BeautifulSoup. BeautifulSoup allows us to make the HTML readable and search through the HTML for elements (such as the table we are looking for).
Here, root
holds the formatted HTML, and is an object we can search through using the .find
method of BeautifulSoup. Let’s first find the exact table in the HTML
We’ve found the table, but how are we going to be able to use it going further? That’s where Pandas come in. Pandas is a library that provides different types of data structures for data manipulation and analysis. When dealing with data in Python, it is a great resource to use for it’s flexibility and features. A DataFrame in Pandas is a 2D-array, represented as a table. We use these tables to access data easily.
Back to the problem at hand. Now that we’ve found the table, we can use BeautifulSoup’s find
method to look through the HTML for the right table (by specifying which class we are looking for). We can then take the table HTML and create a Pandas DataFrame with it for further manipulation.
Congrats! You were able to scrape the web for data and store it in an easy-to-use data structure that we can perform data analysis with. We’re not quite done yet, however. The next step is preprocessing the data.We’ll clean up our DataFrame by removing the Unnamed: 0
column and rename the remaining columns.
Preprocessing data is an essential step for data analysis because it allows us to deal with “tidy data” that only contains values we care about. The data should also be easy to use so that future applications are easier (if we go into data visualization or machine learning).
Now that we’ve cleaned up the data, we can start looking at what we are interested in: the running backs in the class (indicated by the “RB” position). Luckily for us, Pandas makes it easy to filter our data (and there are many ways to do so). We’ll create a new DataFrame to store all of the running backs.
With that, we’ve finished this tutorial! We were able to use the requests and BeautifulSoup libraries to query the Wikipedia webpage and get the HTML of our table. Then we used Pandas and turned that HTML into a DataFrame, which we were able to clean up and filter out to get our list of running backs!
In the next part of the series, we’ll use what we learned about making requests to grab some more data and learn how to mess around with DataFrames. Then we’ll learn to visualize our data and look at how the 2017 running backs are performing in the NFL right now, compared to the rest of the gang.
The complete code from this tutorial is below:
As always, I’d love to hear what you thought about this post! Send any questions or comments to amanjaiman@outlook.com.