Showtime: Pulling Multiple Seasons of Player Stats with NBA_Py

Dan Watson
Hardwood Convergence
12 min readAug 7, 2019
Photo by JC Gellidon on Unsplash

The last two posts discussed how to pull an individual player’s stats with nba_py and then gave a quick overview of the entire package. Today is going to be a longer post as we start pulling larger amounts of data through the package. Either follow along with the code, or you can pull it from our github.

The Plan

  • Introduce the datetime package to manipulate time data
  • Discuss using the random and time modules to limit API calls
  • Explain the multiple options in pulling game logs

Introducing Datetime

Let’s get started by starting a new workbook and importing our packages:

Importing our new packages: datetime, random, and time
Importing our new packages: datetime, random, and time

Not too much new stuff here, but you see that we added a few new packages: datetime, random, and time. These should be installed with your base python install, but if not you’ll need to install them via pip. Since you’ll end up working with dates and times frequently while programming, let’s go into a bit of depth with datetime. First, you’ll see that we called it as dt. This is pretty standard in the python world, just as we call pandas as pd and numpy as np.

Getting a date from datetime is pretty easy. To get today’s date, we just type the following:

Getting today’s date with datetime
Getting today’s date with datetime

You can see that we’re calling the date module from the datetime package and then calling the today() function. It returns a datetime date object that gives us the year, the month, and the day number. If we want to parse out just the month, we can just add .month to the end:

Getting today’s month from datetime
Getting today’s month

You probably noticed that when calling a specific part of a date from datetime, it returns it as an integer! This makes it easy to pass the date into other functions or perform calculations. We can also perform calculations in datetime. Let’s a set a variable to a random date- we’ll go October 4, 2012 and then add a day:

Setting our start_date variable to October 4, 2012
Setting our start_date variable to October 4, 2012

If we just try to add 1 to our date, let’s see what happens:

Attempting to add a day with an integer throws an error
Attempting to add a day with an integer throws an error

You can see that we get a TypeError with this code. This is because we’re trying to add an integer to our datetime.date object. We could get around this by adding 1 to our start_date.day and then feeding that back as a new datetime object. That would be messy and what would happen if we tried to add a day to October 31st? If we were just adding an integer on the date field, we’d get October 32nd, which would be out of the date range and throw an error! Luckily, the datetime package has this figured out with the timedelta module:

Using timedelta to increment time
Using timedelta to increment time

You can see that we can easily increment time by a day with timedelta. You can see that timedelta also takes care of the month and year variables when we add enough days to take us into 2013:

Timedelta keeps track of years, months, and days easily!
Timedelta keeps track of years, months, and days easily!

One last thing with datetime. We all know that you have options in how you format dates. You can use the same date object along with the strftime() function to return datestrings in different formats:

Using strftime to format a date output string
Using strftime to format a date output string

Using Random and Sleep to Rate Limit API Calls

We want data fast, so why are we rate limiting our API calls? A few reasons. First, many sites expressly limit you from more than a set number of calls per hour or per day or they’ll set fees if you go over a certain call limit. Second, for sites that don’t have set limits, it’s just good practice to limit your calls and not overburden their servers. Third, and selfishly, it can keep you from getting blocked by those sites for pinging their servers hundreds of times per minute. However you want to justify it, limiting your calls to an third party API is just good practice.

There are a few ways to do this, but I like using the time and random packages. Here’s a simple example:

Using sleep and random to limit printing of an integer
Using sleep and random to limit printing of an integer

As you can see, we’re just looping through the numbers from 0 to 9 and printing them in our console. But then we have this bit of code before the print statement:

sleep(random.uniform(.05,1.8))

It’s simple, but let’s discuss it to be clear. First, we’re selecting a random number from a uniform distribution. This means that any number in this range is equally likely to be selected. Our range in this example is from .05 to 1.8. We are then passing this number into our sleep function, which tells python to wait that number of seconds before continuing to the next line of code. When we add this code before our print function, we notice the resulting code is staggered in the printing. Sometimes you’re waiting for the next number to print and sometimes you’ll get a burst of 3 numbers that print in a row. It just depends on which random number was selected from our uniform distribution!

We can use this same type of logic with our calls via the nba_py package. Between our calls we can add a random break so that we don’t sends thousands of calls to nba.com every minute.

Pulling Data From NBA_py

We’re finally getting to the code that we’ve been talking about since the first post on this publication. Now that we’re here, we have a few options on how we want to structure the pull. I’m going to start with what I view as the best structure to do this, but practically I don’t think it’s too likely to be successful.

Option 1- Logical, but unlikely to succeed due to rate limits.

In this option we’re going to pull the game ids for each season for which we want to pull data and put them in a list. We’ll then iterate through that list of game ids and create a single dataframe for each table that we want to save.

Pros: This option is the easiest way to get all the game ids necessary to pull the stats we’re after, it automatically removes pre-season and post-season games from the dataset, and we don’t need to pass start dates or end dates for each season.

Cons: This is not a great method for incremental data pulls, so we’ll need to write a separate code for daily updates during the season. Also, if your connection gets interrupted, you need to start the code from the beginning, which isn’t ideal.

Process: We’ll start by storing a list of team_ids of teams that are in the league and a list of seasons for which we want to pull data:

Creating a list of team_ids and seasons
Creating a list of team_ids and seasons

Next we’re going to use the following code to get a list of game ids for every game played in the season that we’re interested in:

Creating a list of game ids
Creating a list of game idsCreating a list of game ids

This code may be pretty clear, but i’ll explain to make sure we’re all on the same page. Here, we are initializing a blank list of game ids. Then we loop through every team and for every season and then add a quick sleep on our machine of .05 to 1.8 seconds (random numbers). After that, we pull the team game logs for our current team and current season and take the ‘Game_ID’ column in a list form and add it to the gids list.

By the way, we haven’t gone over this syntax before, but “+=” just allows you to add to a current variable without adding a redundant reference. So if x =1, x+= 1 would then set x equal to 2. Since adding two lists appends the values, we’re just adding the new game ids to our list.

Now that we have our gids, I’d suggest saving them in a csv so that we don’t have to redo that pull. Also note in the following code that we used set() on our game ids. A set is an unordered iterable of unique values. Since we pulled the game ids for each team for each season, we would have pulled two values of each game id- one for the home team and one for the away team. So this fixes that issue.

Saving our game_ids.
Saving our game_ids.

Now we can start pulling the data. I’m going to be pulling the following tables:

player game logs
player game logs advanced
player game logs four factors
player game logs misc
player game logs scoring
player game logs usage
team game logs
team game logs advanced
team game logs four factors
team game logs misc
team game logs scoring
team game logs usage
game logs other
game log summaries

Here’s the first part of the pull:

First part of pulling game logs

There are definitely more pythonic ways of writing this type of pull, but all we’re doing is pulling each table for each game_id in our gids list. I also added a sleep between each pull to rate limit ourselves a little bit. This code just covers the initial pull for each dataframe, which is then followed by:

Second part of our pull
Second part of our pull

This code just helps us track our progress. Every 300 pulls, we print the index value to update ourselves that we’re still pulling data. We then add a longer sleep value and then continue the process. Finally, when all the data is downloaded, we save each one as a csv:

Saving our dataframes as csvs
Saving our dataframes as csvs

Again, if you can run this process and get it to work, that’s great. I think that you’ll probably get rate limited and blocked by nba.com if you try to run so many pulls against their system. I don’t know what the number of pulls is that will get you there, but maybe play with switching VPNs/IPs during the pull and it’ll be fine. I’ve had mixed results with VPNs, but maybe that’s something you’re more familiar with.

Option 2- Don’t use this option for larger pulls.

In this option, which I’m only going to write out the partial code, you’ll iterate through each day of the nba season, pull all the game ids for that date, and then pull all the game logs.

Pros: This type of code allows you to pull for specific dates and can easily be adapted for daily pulls throughout the nba season. Also, if your pull gets interrupted, you’ll have your data already saved for those dates and you can pick up from where you left off.

Cons: By iterating through the days, you’ll need to make almost 1200 server calls just to get the game ids for the 2012–13 through 2018–19 seasons as compared to 240 with Option 1. Further, this code makes it more difficult to distinguish between regular season and preseason games, so you’ll have to add start dates and end dates to your pulls.

Process: We’ll start with one more quick datetime lesson around using strptime(). Strptime() allows us to get datetime objects from strings. In the code below we create a list of a single start date and a single end date. To get the number of days between these dates, we can subtract their datetime objects. With strptime(), we call the function, pass in the string, and then give the format. You can see that there are 176 days between October 16, 2018 and April 10, 2019. This is our num_days variable. If we add num_days back to our start value, we get our end value we get the same datetime value returned.

Using strptime() to get datetimes from strings
Using strptime() to get datetimes from strings

Now that we know how to get a datetime object from a string, we can initialize lists of start and end dates for the nba seasons and get a number of days we have to pull for each year:

Setting our season start and end dates and calculating the length
Setting our season start and end dates and calculating the length

Since I don’t really endorse this method for our purpose, I’m only providing the pseudo code for how you would pull this data:

Pseudo code for pulling data per option 2
Pseudo code for pulling data per option 2

Basically, all you would do is zip the start date and the length of season variables and then iterate through each day. Add that number of days to the date and look for game ids. Once you have the game ids for that day, you’d save the tables you want to pull. Again, this is very rough pseudo code, but I really don’t think this is a good option for this purpose.

Option 3

Here’ I’m going to pull the game logs. It’s not the most efficient method, but it will hopefully stay under the nba rate limit and if the connection gets interrupted, we can just continue from wherever it left off.

Process: I’m going to start with dataframe of game_ids that we pulled and saved previously. Since this is the first time we’re running this, I’m adding a column called “downloaded” to the dataframe and initializing it’s value to 0:

Loading our gids dataframe
Loading our gids dataframe

You’ll see in the workbook that this code is commented out. I immediately saved the dataframe and then commented this code out so that I don’t accidentally run it again. We are using the downloaded column to track our progress. If our connection gets reset, I will just load the most recent version of the dataframe and limit the pulls to rows where downloaded is 0.

Also, you’ll notice that we give a dtype argument when we read the csv. This is because the game ids start with ‘00’. Python will read these as an integer value and remove the leadings zeros. Specifying the data type as a string will leave in the leading zeros and allow our code to run correctly.

Next, we’re going to do a similar method to our first option. We set up a for loop to iterate through our df_gids index. Every 100 games we’re going to print the index to track our progress. We are also going to add in a sleep of 3–45 seconds to slow down our call rate. Next, we’re going to set our gid variable to the current row’s game id and then pull each table as we did before.

The slower, but will eventually work method
The slower, but will eventually work method

The following code remains in that same for loop:

Saving our dataframes
Saving our dataframes

I know there are more pythonic ways to write this, but I like being explicit when I’m saving data and moving between directories. Prior to running this code, I made a base_data folder in the main medium_tutorials folder and then made folders for each type of game log we’re saving. After all the game logs for a particular game id are pulled, we are going to move into the appropriate folder and save the dataframe as a csv. You can see that we named each dataframe in the following format:

gl_{game log type}_{}.csv

The open brackets in our code tell python that we are going to pass a variable into the string, which we then pass in with the .format() method. We’ll add the gid variable so that each csv is appropriate labeled. Finally, once everything is saved, we’ll mark the downloaded column for that game id to 1 to track our progress.

Once it starts running, you can check out your data folders and see the progress!

Data downloading!

This process may take a while and you may have to restart multiple times, but you’ll eventually get through it. After the initial pull, the incremental updates will be quick and painless!

Wrapping up

Thanks for sticking with me on that long post! This is really the most important post so far as without the data, we can’t perform any analytics! Now that we have it though, we have to store it somewhere we can process it. We’ll do that next when we learn how to create a PostGRES database to store our game logs.

Thanks again for checking out the post! Please leave any questions or suggestions you have below. If you learned anything from this post, please smash that clap button and share with a friend!

--

--

Dan Watson
Hardwood Convergence

Data nerd, basketball fanatic, and ice cream connoisseur. Health care analytics by day, basketball analytics by night. https://www.linkedin.com/in/danielkwatson