Cleaning Flat Files and Calculating Basic NBA Stats with Pandas

Published in

Hardwood Convergence

9 min readJul 21, 2019

The Plan

Now that we have a small amount of data, let’s start putting it to good use. If you’re just joining us, feel free to go back through the first few tutorials or download our repository on github. Here’s what we’ll be doing today:

Load the flat file dataset into pandas
Explain the columns- skip this if you know basic basketball stats
Calculate A Few Normalized Stats

That’s enough of an outline, let’s get rolling.

Loading Flat Files with Pandas

Let’s start a new jupyter notebook in the same web_scraping folder. We’ll name it harden_stats.ipynb and import pandas and os. Now we just need to load in Harden’s game logs that we created last tutorial. That’s easy enough, just change the working directory to the game_logs directory and use the pandas load_csv() function to load in the dataframe.

A couple quick notes. To see your current directory in python, you can use:

os.getcwd()

To see what files are available in that directory, call the listdir function on that directory. It looks something like this:

os.listdir(os.getcwd())

Now that we’re in the right directory, we can load the dataframe and make sure it looks correct by checking out the first few rows:

All looks good so far, so let’s check out the data types to understand what kind of data we’re working with. We’ll use the dtypes attribute of the dataframe object:

We can see that python is inferring the data types based upon the data it received. That makes our life easier as we don’t have to cast data as a specific data type before writing our code! Now to get Harden’s points per game, we can just write:

df[df.game_season != '']['pts'].mean().round(1)

A Brief Intermission to Introduce These Stats

We now have our dataframe in python. If you’ve read this far, I’m assuming you’re probably somewhat familiar with basketball. If not, but you’re interested, let’s quickly run through these columns:

game_season: game number this season for this particular player
date_game: date the game was played
age: player’s age in years and days
team_id: abbreviation of the player’s team
game_location: NaN= home game for player, @= road game for player
opp_id: abbreviation of the opponent’s team
game_result: W= win, L= loss. (Margin of Victory)
gs: 1 = player started game, 0 = did not start
mp: amount of time played in the particular game
fg: number of shots from the field made
fga: number of shots from the field attempted
fg_pct: fg/fga
fg3: number of three point shots made
fg3a: number of three point shots attempted
fg3_pct: fg3/fg3a
ft: number of free throws made
fta: number of free throws attempted
ft_pct: ft/fta
orb: number of offensive rebounds
drb: number of defensive rebounds
trb: orb + drb
ast: number of times the player assisted a teammate on a basket
stl: number of times the player stole the ball from the opponent
blk: number of times the player blocked an opponent’s shot
tov: number of times the player turned over the ball. 
pf: number of personal fouls called on the player in the game
game_score: an attempt to determine the player’s total value to a game. Typically 10 would be an average player score and 40+ would be an excellent game.
plus_minus: difference between points scored by the teams when the player was on the court. A positive plus minus means the team scored more than the opponent while the player was playing.

Making All Columns Useful

We’ve downloaded our data and have it in a mostly usable format, but we can tell there are fields that could be valuable that we can’t easily manipulate. Let’s focus on four new fields to start:

We’ll change minutes played into a float value.
Lets create an indicator for a game played to save us from filtering out blank game_season fields.
Make a home game binary field to group games.
Create a win-loss field.

Transforming Minutes to Float

There are a few ways we can do this, but let write a function that manipulates the string and returns a float value. Here’s a function we can use:

def str_min_to_float(x):
    if pd.isna(x)==True:
        return 0
    else:
        mins = int(x.split(':')[0])
        secs = int(x.split(':')[1])/60.0
        val = mins + secs
        return round(val, 1)

The function takes the minute string as x and checks to see if there is a value there. If the variable doesn’t have a value, then the minutes played is set to 0. Otherwise, we split the time string based on the colon. The left part of that split becomes an integer that represents the minutes played. The right part of the split becomes an integer of seconds played that is divided by 60.0 to return the fraction of a minute played. We add these values together and round them to one decimal spot to get our minutes played float.

Now that we have our function, let’s use the apply() method to run it on the entire dataframe. We’ll save this as a new mp column. Here’s the code:

df['mp'] = df['mp'].apply(str_min_to_float_to_float)

Games Played Indicator

Here, we can just look at the game_season column and check if has a non-blank value. If so, we want our game indicator to return a 1, else a 0. For this, we can use a list comprehension and set it to the value of a new column:

df['g'] = [1 if pd.isna(df.game_season[gm])==False else 0 for gm in range(len(df))]

Now, if we take the sum of that column (df.g.sum()) we see that Harden played in 78 games last season.

Home Game Binary Indicator

I find it easier if we have a binary (1,0) indicator for a home game instead of a blank vs @ sign. Let’s use another list comprehension like above to create this field. After we create this field, we’ll drop the game_location column so we don’t have duplicate data:

df['home_game'] = [1 if pd.isna(df.game_location[gm])==True else 0 for gm in range(len(df))]
df.drop(['game_location'], inplace=True, axis=1)

We’ve gone over the list comprehensions a few time, so nothing new there. Let’s quickly explain the drop() method. Drop allows us to remove columns or rows from a dataframe. The first argument is a list of columns we want to drop (this can also just be a single object, but it’s easier to just always use a list). The inplace argument allows us to drop the column without setting the df object to the df with the dropped column…if that’s unclear it means:

We can do this:
    df.drop(['column_name'], inplace=True)
Instead of this:
    df = df.drop(['column_name'], inplace=False)

Finally, the axis=1, means we’re dropping along the columns.

Win-Loss Indicator

We know that the game result field must start with a W or an L since ties aren’t allowed in basketball. Let’s do another list comprehension that checks whether the first value in that field indicates a win or a loss and change it to a binary value:

df['win'] = [1 if df['game_result'][gm][:1] == 'W' else 0 for gm in range(len(df))]
sum(df.win)

The sum should print out to 53 wins. The Rockets went 53–29 last season, so that checks out.

Calculating A Few Normalized Stats

What do we mean when we say “normalized stats?” In this sense, we mean transforming the statistic so that we can compare evenly despite the circumstances around the stat being different. We’ll use the term normalized much more rigidly when we start machine learning. Anyway, let’s look at an example in our data.

On November 23, Harden scored 33 points in a game against the Detroit Pistons. A week later, on November 30th, he scored only 23 points in a game against the San Antonio Spurs. In which game did Harden score more efficiently? At first look, we would say the November 23rd game since he had 10 more points than the game on the 30th. But when we check out the other stats, we see that Harden played almost 41 minutes, whereas he only played 27.5 minutes on November 30th. Let’s normalize the points stat by how many points Harden scored per-36 minutes to get one simple view to answer our question.

Since we want flexibility in standardizing our stats by time (we could do per minute, per 36, per 48, etc.), let’s write a function that affords us that flexibility. We’ll use the following function:

def stat_per_time(stat, minutes, per= 36):
    if minutes == 0:
        return 0
    else:
        val_min = stat/minutes
        return round(val_min * per, 1)

The function begins with the minutes variable. If minutes is 0, then we’re just going to return 0 since we don’t want a div/0 error. Otherwise, the function is going to calculate the stat per minute played and then it is going to return that value multiple by our per amount, which is defaulted to 36.

Now, we need to learn how to apply this function. We cannot use the regular apply method because we’re applying this to multiple columns. This is a great spot to introduce lambda. Lambda allows us to create anonymous functions in python. We’ll use it here to create a pts_per_36 column with the following code:

df['pts_per_36'] = df.apply(lambda row: stat_per_time(stat = row['pts'], minutes = row['mp']), axis=1)

Let’s break this down into pieces to make it clear. You can see that instead of calling apply on a specific column, we’re instead calling it on the entire dataframe. Since we have to pass the function multiple columns, we’ll create an anonymous function with lamba and the newly created row variable. We’ll apply our stat_per_time function with the stat being each row variable’s ‘pts’ column and the minutes being each row variable’s ‘mp’ column. We’ll then pass this down the 1 axis (columns) in the dataframe. So what we’re doing is really calling our stat_per_time function on every row in the dataframe and feeding it that row’s pts and mp columns.

This provides us flexibility, because if we wanted to look at another per 36 stat, such as rebounds per 36, we could just do the following:

df['trb_per_36'] = df.apply(lambda row: stat_per_time(stat = row['trb'], minutes = row['mp']), axis=1)

We can also do turnovers per 12 minutes by adjusting the the stat and the per arguments:

df['tov_per_12'] = df.apply(lambda row: stat_per_time(stat = row['tov'], minutes = row['mp'], per= 12), axis=1)

So we achieved our goal of creating a flexible function to calculate multiple standardized stats. Now let’s go back to our question of which game Harden was a better scorer. Let’s pull just the columns we need from our dataframe and check it out:

df[(df.date_game=='2018-11-23')|(df.date_game =='2018-11-30')][['opp_id', 'pts', 'pts_per_36']]

Harden’s pts per 36 for our comparison games

Now we see that Harden scored 33 points against Detroit, which was 29 points per 36 minutes played. Against San Antonio a week later, Harden only scored 23 points, but when adjusted for time, he was scoring at a pace of 30 points per 36 minutes played. Does this mean Harden played better against San Antonio? Of course not, this is just one measure. But it is the type of thinking that will help us make fairer comparisons in the future.

Wrapping Up

We covered a lot more ground in this tutorial. We learned how to load a flat file into pandas, manipulate columns, engineer new features, and apply functions to dataframes in two different ways!

Don’t forget to save your dataframe as a csv file because we’ll be using it again in the next tutorial. We’ll focus next time on grouping statistics and we’ll start some basic plotting.

Thanks for checking this out and hope you found it valuable. If you learned something from this tutorial, please tell a friend and smash that clap button!👏👏👏