How To: Manipulate DataFrames in Pandas
This is part 2 of 3 in the series working with the 2017 running backs. Click here to view part 1.
To be able to learn new things from data sources, you need to make sure that the data is clean and easy to work with. This is the most important (and also a lot of people’s least favorite) step in working in Data Science.
If you remember the last How To, we’re trying to see how good the 2017 running backs really are, since we were told that draft class is among the best in the league. Now, let’s clean the data so we can see if it’s true!
In this How To, we’re going to be using our 2017 running back data again, and seeing if that class of backs really is dominating the NFL. We’ll start by collecting a little bit more data, cleaning it so it is usable, and then start analyzing it.
Creating Our “Clean Data”
By the end of the last post, you should’ve been able to create a DataFrame in Pandas that stores the running backs drafted in 2017. At first glance is seems like the data is fine to work with, but it turns out, there is a little bit of preprocessing we should do to make it easier to work with.
Cleaning the Data
Taking a look at the data, we can see that some of the players have a † next to their name. This indicates if the player was in the Pro Bowl or not, but for our purposes, we can get rid of it from the name (plus, the Pro Bowl isn’t always a great measure of how good a running back really is). Another thing to notice is that some round numbers have a *. This tells us if the player was a compensatory selection or not, but we only care about the player, not why/how they were drafted.
To clean those parts of the data, we’ll have to iterate through the rows of the DataFrame and clean each entry. There are many ways to iterate through a DataFrame, and each has it’s benefits, but for simplicity, we’ll be using the iterrows method.
You can check out the cool performance differences of the ways to iterate in a great post on Towards Data Science
When we call .iterrows()
on our DataFrame, it generates a pair of (index, Series) that we can iterate through. A Series in Pandas is similar to a list (it is a wrapper around the NumPy array). In our case, the Series is each row, and we can access different columns of the row using the column name.
One thing to note about iterrows is that the row that is returned is a copy of the original. This means if you modify it, the original DataFrame is unchanged. Since we need to actually change our DataFrame, we can use .loc
, which is a way to locate a value by a (row,col) pair. Then, once we’ve gotten the actual value, we can change it. We’ll use this to change a player’s name with a † to without, and the same for the round number with a *.
Let’s see all of this in action now.
Adding Name Columns
The last thing we’ll do to our DataFrame is split the name into two columns: first and last. This is so that when we look for players in our DataFrame, we can access them by their last name (the other data set we’ll be working with doesn’t have player’s first names).
To do so, first take a look at the names of the running backs. All of the names are in the form “first last”… except for T. J. Logan, who has two spaces in his name. We’ll have to change his name to match the format of the others, but don’t fear! We’ve already seen how .loc
works, so we can use it again here.
Next, we’ll use str.split()
to split the first and last names. When you call this on a column of the DataFrame, it will split the string by the given separator. If we set it’s expand
argument to True
, it will automatically convert the splitted string into columns of a DataFrame.
We can set these two columns to two new columns in our rb_df
.
Collecting 2018 and 2019 Stats
We’ve got our running backs from 2017, but we need some stats to see how they’ve done. Luckily for us, Football Outsiders has their data in some nice tables for us to use. Here is their 2018 and 2019 data.
The next part is going to be what we already learned in the last How To. We’ll use requests to get these tables for the 2018 and 2019 web pages, and store them as DataFrames. I’ve provided the code to do so, but it will be very similar to how we got the initial data.
Note: I also looked at the 2017 web page, and because of how the table was formatted, it posed a lot of challenges for creating a usable DataFrame (columns had wrong datatypes, unnecessary rows, etc.). While doable, we can stick to just 2018 and 2019, which is more than enough data. If you want to explore the 2017 data to see how the running backs did immediately after the draft, go for it!
With these DataFrames, we’re going to do two things: change the type of a few of the columns, and add a couple of new ones.
Fixing Column Types
One thing you should notice when you look at the data is that most of the values are numeric; however there are three columns that aren’t: DVOA, VOA, and Suc Rate. The percent sign in those columns means that the column datatype is a string. We can verify this easily.
rb2018_df['DVOA'].dtype # dtype('O')
rb2018_df['VOA'].dtype # dtype('O')
rb2018_df['Suc Rate'].dtype # dtype('O')
The dtype('O')
means Object in Pandas, which in this case is a string. We need it to be a float or int so that we can use it for visualizing with graphs.
To do this, we can take the column and apply str.replace('%', '')
to get rid of the percent sign. Then, once there are only float values left, we can use .astype(float)
to convert the entire column from a string column to a float column.
Adding Columns to Our DataFrames
First, let’s indicate if a running back belongs to the 2017 draft class. To do this, we will look to see if the running back’s last name is in our last
column of the rb_df
. We can grab the column as a Series just by accessing it: rb_df['last']
and then convert it to a list with .tolist()
.
Next step: back to iterrows! We’ll go through each row of the 2018 (and 2019) stats DataFrames and check if the last name of the player is in our list. Pandas allows adding a column from a list, so we can keep track of this in a list. If the player is in the 2017 list, we can append True
and otherwise False
. Lastly, simply create a new column by naming it and setting it to the list.
The next two columns we’re going to add are Yards Per Attempt (YPA) and Effective Yards Per Attempt (EYPA). These two values are based on two values we already have in our DataFrame: YPA = Yards/Runs and EYPA = EYds/Runs. This is something that Pandas handles really well.
Because we want to do the same operation for each row of the columns, we can simply perform the operation on the columns, and Pandas will carry it out entry by entry. It looks something like this:
It’s that easy! With that, we’re done with all the DataFrame manipulation!
We were able to learn how to edit data in Pandas, by iterating through the DataFrame, using .loc
and column operations. Now our data is ready for the last step: analysis.
In the final tutorial of this series, we’ll be learning how to visualize our data so that we can look at visuals to draw conclusions, rather than sift through the numerous columns of data we already have. This is the last step in being able to determine how good that 2017 running back class really was.
The complete code from this tutorial is below:
As always, I’d love to hear what you thought about this post! Send any questions or comments to amanjaiman@outlook.com.