Move Over Scraping: Pulling NBA Data with NBA_py

Published in

Hardwood Convergence

6 min readJul 26, 2019

* WARNING DO NOT FOLLOW THIS CODE*

The nba_py package used in this tutorial doesn’t work anymore. Follow this for concepts, but you can now use the nba_api package that is working (updated 4/7/21). A brief tutorial on that package is here.

The Plan

We’ve been using the same set of James Harden game logs from the time we learned to scraped basketball-reference with BeautifulSoup until last post when we started plotting statistics with Matplotlib. It’s been great, but our skills are improving and it’s time to start working with more data. We’ll start that process today. Here’s our plan:

Install and setup nba_py
Learn how to pull team and player information
Quick example on how to pull stats

Installing and Setting Up NBA_py

NBA_py is a python package that connects to the stats.nba.com API. This gives us access to tons of NBA data by making simple python calls rather than scrapping websites. Installing nba_py is easy enough: open your terminal, activate the virtual environment, and enter:

pip install nba_py

Now that it’s installed, let’s start using it. I’ve created a new intro_nba_py folder that contains our new notebook for learning this package, which is available here on our github. We’ll fire up our Jupyter notebook and import our packages and make our first call:

Importing nba_py and making our first call

First, I don’t know if there is a convention to name nba_py on import, but when I first started using it I didn’t know what variable names would import. I felt that it would be safest to import as nba, so that’s my process but it seems unnecessary. I also imported each of the modules in the package so we’ll have access to them when the time comes. As always, you do you. Next, when running that test line of code, if you get back the Scoreboard object you are good to go. If the code runs forever and doesn’t return anything (which is probably the case), then we have to make a quick fix.

I’m not an expert on the actual issue, but it seems like a certain association doesn’t want people to be able to pull data from their API via scripts, so they require certain headers in the API requests. While I don’t understand how/why this all works, I do know the fix. Here’s what you need to do get it in the game:

Open the folder where you installed your virtual environment (medium_tutorials if you downloaded from our github) and hit cmd+shft+. which will show hidden folders. Following this path: .venv > lib > python3.7 > site-packages > nba_py and open the init.py file in a text editor- I recommend Sublime. It looks something like this:

We’re going to delete the HEADERS code on lines 24–27 and replace it with the following code:

HEADERS =   {'user-agent': ('Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'),
    'Dnt': ('1'),
    'Accept-Encoding': ('gzip, deflate, sdch'),
    'Accept-Language': ('en'),
    'origin': ('http://stats.nba.com')}

Save the file, refresh your imports, and voila, the code works! Now we can learn how to use this library.

Accessing Team and Player Information

We’re going to spend the remainder of this post (and probably the next few posts) by getting acquainted with nba_py. Definitely follow along with these scripts so you’ll start getting used to finding the data you want. Also, here is nba_py’s documentation, which is tremendously helpful.

Let’s start by pulling a list of teams. If you’ve read through the docs, you’ve seen that within the team module there is a TeamList() function. Let’s call it:

We see that it outputs a TeamList object, which is great, but not helpful since we want the actual teams. If you check out the documentation under team.TeamList(), you’ll see there is a method called info() that is listed. Let’s add that and see what happens:

You’ll see that info() returns a pandas DataFrame of the call, which here returns the current 30 NBA teams as well as references to some no longer existing teams…R.I.P. Providence Steamrollers. Note that the TEAM_ID and ABBREVIATION fields are matched here, so this will become a lookup table when we need to connect our team ids to their abbreviations.

On the player side, we have a similar function. Here’s our initial call:

Pulling a list of players for 2018–19 season

You can see that we had to pass along a few arguments in our PlayerList call. The league_id identifies that we’re only looking for NBA players- ‘00’. Not sure what other leagues it can pull or the codes for those, but I’m sure we’ll figure out if we can get Summer League and G-League stats somewhere on our journey. Also, you can see that we can identify a season for which we want to pull players and also include a flag to only include current players.

Pulling Game Logs

We’re obviously barely scratching the surface of what we can do with this package, but I wanted to briefly wanted to pass along a process of pulling stats so you can start using this on your own. Let’s quickly recreate the James Harden game logs dataframe that we scraped earlier (this will be the last time we focus on Harden in a long while, it’s just for comparison!). Here’s all the code:

Pulling Hardens 2018–19 game logs in nba_py

Running through this code, you can see that we first called the player dataframe and saved it as nba_players. We then searched the nba_players for the row that had the name “James Harden” and then saved his person_id as harden_id. Finally, we passed the harden_id into PlayerGameLogs, indicated we wanted the 2018–19 season and created a dataframe. You’ll notice here that the dataframe that was returned is only 78 rows long, not 82. This means that it only included game logs for which Harden was active. We can now calculate Harden’s average points without sub-setting our data:

nba_players = player.PlayerList(league_id='00', season='2018-19', only_current=1).info()
harden_id = nba_players[nba_players.DISPLAY_FIRST_LAST == 'James Harden']['PERSON_ID']
harden_gl = player.PlayerGameLogs(harden_id, season = '2018-19').info()harden_gl.PTS.mean().round(1)

It just took us four lines of code to get Harden’s average points per game for 2018–19. Looking back at our scraping tutorial, we needed around 17 lines of thought out code to accomplish the same task!

Wrapping Up

We just learned an exponentially easier method of downloading NBA data with nba_py. Next, we’ll start working our way through the different modules to showcase what data is available. After that we’ll write some scripts to capture larger chunks of data that we can feed into a database and analyze. As always, the code is available on our github, so check it out, change it around, and get a good feel for how this all works!

Hope you all enjoyed this post and found it useful. If you learned anything new, please share this with a friend and smash that clap button! 👏👏👏