A quick look into visualizing NBA shot data

Nam Nguyen
8 min readDec 19, 2022

--

One obvious but fascinating trend that has emerged over the greater part of the past decade, is the rise of the 3-point shot. Steph Curry and the Golden State Warriors have notably demonstrated how potent offenses could be with skilled 3-point shooters, and how the spacing and gravity of these players also open up better opportunities at the rim.

I was curious to see the shot selection of teams across the league as analytics have more of an impact on the systems on today’s game. Thanks to the NBA api, we have access to heaps of data that are continuously collected over the course of an NBA season. Using Python and the nba_api, I’m going to extract some shot data from the 2021–2022 NBA regular season and share some insights. Note: this analysis doesn’t intend on revealing any novel or game-breaking insights that aren’t already obvious to most basketball fans, but serves more as practice and an introduction to understanding the life-cycle of data: data extraction, manipulation/transformation, storage, and visualization.

Getting the data

To use the NBA api, you will need Python 3.7+, numpy, requests, and pandas to work with dataframes (pip install nba_api). I’m using Jupyter Notebooks to gather and visualize some of the data.

from nba_api.stats.endpoints import shotchartdetail, leaguedashplayerstats, teamdashboardbyshootingsplits
from nba_api.stats.static import players, teams
import json
import requests
from requests import get
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import string
import time

To get the shot data, I’m going to use the shotchartdetail endpoint. This endpoint requires a player id and their respective team id. As an example, I can find shot data for Denver Nuggets player Nikola Jokic with the code below.

player_dict = players.get_players() # get dictionary of player info
team_dict = teams.get_teams() # get dictionary of team info

def get_player_id(fullName): # get player id with player name
player_id = [player for player in player_dict if player['full_name'] == fullName][0]['id']
return player_id

def get_team_id(teamName): # to get team id using full team name
team_id = [team for team in team_dict if team['full_name'] == teamName][0]['id']
return team_id

def get_team_id2(abbreviation): # to get team id using team abbreviation
team_id = [team for team in team_dict if team['abbreviation'] == abbreviation][0]['id']
return team_id

player_json = shotchartdetail.ShotChartDetail(
team_id = get_team_id2('DEN'),
player_id = get_player_id('Nikola Jokic'),
context_measure_simple = 'FGA',
season_nullable = '2021-22',
season_type_all_star = 'Regular Season'
)

get_shotjson = json.loads(player_json.get_json())
realdata = get_shotjson['resultSets'][0]
columns = realdata['headers']
player_rows = realdata['rowSet']
nikola_jokicdf = pd.DataFrame(player_rows)
nikola_jokicdf.columns = columns
nikola_jokicdf

I’m going to need to find the list of players who have played in the 2021–2022 season. To do this, I use the leaguedashplayerstats endpoint.

ps_json = leaguedashplayerstats.LeagueDashPlayerStats(
per_mode_detailed = 'PerGame',
season_type_all_star = 'Regular Season',
measure_type_detailed_defense = 'Base',
season = '2021-22'
)

ps_load = json.loads(ps_json.get_json())
ps_data = ps_load['resultSets'][0]
ps_rows = ps_data['rowSet']
ps_headers = ps_data['headers']

psdf = pd.DataFrame(ps_rows)
psdf.columns = ps_headers
psdf

I now have a dataframe with all players from the ‘21–22 season. Some players may have records but have not played a single minute or attempted a field goal, so I filter by players that have greater than 0 field goal attempts.

psdf = psdf[psdf['FGA'] > 0] 

With this dataset of players, I can now iterate through each row and find shot data for each player. (Note: This data is a snapshot of players and the teams they were on at the end of the season, any players traded during the
season may have stats not accounted for from their time with teams at the start of the season
)

def get_shotlocations(id_,teamid):
player_json = shotchartdetail.ShotChartDetail(
team_id = teamid,
player_id = id_,
context_measure_simple = 'FGA',
season_nullable = '2021-22',
season_type_all_star = 'Regular Season'
)
loaddata = json.loads(player_json.get_json())
realdata = loaddata['resultSets'][0]
player_rows = realdata['rowSet']
return player_rows

I created a function get_shotlocations that utilises the shotchartdetail endpoint to fetch shot data, accepting a player id and team id and returning a json dictionary containing all shots from that player. Iterating through my dataframe with all players, I collect all their shot data and put it into a dataframe as shown in the code below.

shotrows = []
for id_,abbrev,name in psdf[['PLAYER_ID','TEAM_ABBREVIATION','PLAYER_NAME']].itertuples(index=False):
print(name) # to see progress of query, not needed
playerrows = get_shotlocations(id_,abbrev)
shotrows = shotrows + playerrows
time.sleep(1) # implemented wait time as I was getting json-related errors during run, unsure if actual solution

totaldf = pd.DataFrame(shotrows)
totaldf.columns = columns
totaldf.to_csv('allshots.csv', index=False)
totaldf

Now that I have my dataset, I’m going to save it as a CSV file so I can access it again without needing to run the code to extract the data, as it took several minutes to complete. I can now analyze and do some visualizations using matplotlib.

Using this Naveen’s tutorial (https://towardsdatascience.com/make-a-simple-nba-shot-chart-with-python-e5d70db45d0d) to visualize some shot data, we can use our dataset to visualize a teams shots or a players shots. As an example, I’ll create a shot chart heat map for 2021–2022 Rookie of the Year Scottie Barnes:

scottiedf = totaldf.loc[totaldf.PLAYER_NAME == 'Scottie Barnes']

def create_court(ax, color):

ax.plot([-220, -220], [0, 140], linewidth=2, color=color)
ax.plot([220, 220], [0, 140], linewidth=2, color=color)
ax.add_artist(mpl.patches.Arc((0, 140), 440, 315, theta1=0, theta2=180, facecolor='none', edgecolor=color, lw=2))
ax.plot([-80, -80], [0, 190], linewidth=2, color=color)
ax.plot([80, 80], [0, 190], linewidth=2, color=color)
ax.plot([-60, -60], [0, 190], linewidth=2, color=color)
ax.plot([60, 60], [0, 190], linewidth=2, color=color)
ax.plot([-80, 80], [190, 190], linewidth=2, color=color)
ax.add_artist(mpl.patches.Circle((0, 190), 60, facecolor='none', edgecolor=color, lw=2))
ax.add_artist(mpl.patches.Circle((0, 60), 15, facecolor='none', edgecolor=color, lw=2))
ax.plot([-30, 30], [40, 40], linewidth=2, color=color)
ax.set_xlim(-250, 250)
ax.set_ylim(0, 470)
ax.set_xticks([])
ax.set_yticks([])
colors = {'2PT Field Goal':'tab:red', '3PT Field Goal':'tab:green'}
mpl.rcParams['font.family'] = 'Avenir'
mpl.rcParams['font.size'] = 18
mpl.rcParams['axes.linewidth'] = 2
# put specific shot info below!
ax.hexbin(scottiedf['LOC_X'],scottiedf['LOC_Y']+60, gridsize=(30,30), extent=(-300, 300, 0, 940), cmap='Blues',bins ='log')
fig = plt.figure(figsize=(4, 3.76))
ax = fig.add_axes([0, 0, 1, 1])
ax = create_court(ax, 'black')

plt.show()
Scottie Barnes 2021–2022 FGA heatmap

Et voila, a beautiful visualization of the extracted shot chart data. Shoutout to Naveen.

I wanted to see the proportion of 3-point field goal attempts vs. all field goal attempts for teams across the league, as well as proportion of shots at the rim (categorized as ‘Restricted Area’ in the dataset). For this, I’m going to use a different endpoint (teamdashboardbyshootingsplits) to make things easier, although it could also be achieved by filtering the dataset I collected earlier.

# Creating list of teams (in abbreviated form)
team_list = [team['abbreviation'] for team in team_dict]
team_list = sorted(team_list)

# Calculating team 3PA + Rim attempts (Restricted area FGA)
rows = []
for team in team_list:
ss_json = teamdashboardbyshootingsplits.TeamDashboardByShootingSplits(
team_id = get_team_id2(team),
measure_type_detailed_defense = 'Base',
season = '2021-22',
per_mode_detailed = 'PerGame'
)

ss_load = json.loads(ss_json.get_json())
ss_data = ss_load['resultSets'][3]
ss_rows = ss_data['rowSet']
ssheaders = ss_data['headers']

ssdf = pd.DataFrame(ss_rows)
ssdf.columns = ssheaders
a = ['Restricted Area',
'Left Corner 3',
'Right Corner 3',
'Above the Break 3'
]
lay3rate = ssdf[ssdf.GROUP_VALUE.isin(a)].FGA.sum()/ssdf.FGA.sum()
threes_rate = ssdf[ssdf.GROUP_VALUE.isin(a[1:4])].FGA.sum()/ssdf.FGA.sum()
data = [team, lay3rate, threes_rate]
rows.append(data)
print(f'{team}: {lay3rate}, {threes_rate}')
time.sleep(1)

teamshotdf = pd.DataFrame(rows)
teamshotdf.columns = ['Team','3PA + Restricted FGA Rate', '3PA Rate']
teamshotdf.to_csv('3PA_Restricted_rate.csv',index=False)
teamshotdf.sort_values('3PA + Restricted FGA Rate', ascending=False)
teamshotdf

Even after the Daryl Morey era, the Houston Rockets continue to shoot a staggering amount of 3-point shots and shots at the rim, highest in the league with 80% of their attempts coming from those areas.

Storing the data in a SQL Server database

This part of the project serves as practice and getting familiarity with SQL databases and how data would be stored and used in a practical setting. I created a database that would store the shot data I collected earlier, and potentially, shot data from past seasons or new shot data from the current season that could be automatically entered into the database. This query will create the main shots table containing the data I extracted with Python.

CREATE TABLE [dbo].[shots] (
[ShotID] bigint PRIMARY KEY,
[GAME_ID] bigint,
[GAME_EVENT_ID] int,
[PLAYER_ID] bigint,
[PLAYER_NAME] varchar(50),
[TEAM_ID] bigint,
[TEAM_NAME] varchar(50),
[PERIOD] tinyint,
[MINUTES_REMAINING] tinyint,
[SECONDS_REMAINING] tinyint,
[EVENT_TYPE] varchar(50),
[ACTION_TYPE] varchar(50),
[SHOT_TYPE] varchar(50),
[SHOT_ZONE_BASIC] varchar(50),
[SHOT_ZONE_AREA] varchar(50),
[SHOT_ZONE_RANGE] varchar(50),
[SHOT_DISTANCE] tinyint,
[LOC_X] smallint,
[LOC_Y] smallint,
[SHOT_ATTEMPTED_FLAG] bit,
[SHOT_MADE_FLAG] bit,
[GAME_DATE] bigint,
[HTM] varchar(3),
[VTM] varchar(3)
)

With my table now created, I can normalize the database by creating separate tables with the following relationships shown by the entity relationship diagram below:

I created separate tables for teams, players, and all other relevant shot data that would now be related via their respective primary and foreign keys. I used the following SQL queries to create these new tables and drop the columns from the original table:

-- CREATE shots_data table via SELECT INTO from shots table
SELECT ShotID,
GAME_ID,
GAME_EVENT_ID,
PERIOD,
MINUTES_REMAINING,
SECONDS_REMAINING,
EVENT_TYPE,
ACTION_TYPE,
SHOT_TYPE,
SHOT_ZONE_BASIC,
SHOT_ZONE_AREA,
SHOT_ZONE_RANGE,
SHOT_DISTANCE,
LOC_X,
LOC_Y,
SHOT_MADE_FLAG,
GAME_DATE,
HTM,
VTM
INTO shots_data
FROM shots;

-- CREATE players table
SELECT DISTINCT PLAYER_ID, PLAYER_NAME, TEAM_ID INTO players from shots;

-- CREATE teams table
SELECT DISTINCT TEAM_ID, TEAM_NAME INTO teams from shots;

-- DROP COLUMNS from shots table

ALTER TABLE shots
DROP COLUMN GAME_ID,
GAME_EVENT_ID,
PLAYER_ID,
PLAYER_NAME,
TEAM_ID,
TEAM_NAME,
PERIOD,
MINUTES_REMAINING,
SECONDS_REMAINING,
EVENT_TYPE,
ACTION_TYPE,
SHOT_TYPE,
SHOT_ZONE_BASIC,
SHOT_ZONE_AREA,
SHOT_ZONE_RANGE,
SHOT_DISTANCE,
LOC_X,
LOC_Y,
SHOT_MADE_FLAG,
GAME_DATE,
HTM,
VTM;

And now, if I wanted to query all the shot data for a certain team, say, the Toronto Raptors, I could use do something like this:

-- QUERY TO JOIN tables to get Toronto Raptors shot data
SELECT p.PLAYER_ID,
p.PLAYER_NAME,
t.TEAM_ID,
t.TEAM_NAME,
s.ShotID,
sd.GAME_ID,
sd.GAME_EVENT_ID,
sd.PERIOD,
sd.MINUTES_REMAINING,
sd.SECONDS_REMAINING,
sd.EVENT_TYPE,
sd.ACTION_TYPE,
sd.SHOT_TYPE,
sd.SHOT_ZONE_BASIC,
sd.SHOT_ZONE_AREA,
sd.SHOT_ZONE_RANGE,
sd.SHOT_DISTANCE,
sd.LOC_X,
sd.LOC_Y,
sd.SHOT_MADE_FLAG,
sd.GAME_DATE,
sd.HTM,
sd.VTM
FROM players p
JOIN teams t
ON p.TEAM_ID = t.TEAM_ID
JOIN shots s
ON p.PLAYER_ID = s.PLAYER_ID
JOIN shots_data sd
ON s.ShotID = sd.ShotID
WHERE t.TEAM_NAME = 'Toronto Raptors';

More visualization tools — Power BI

Lastly, I can also use Power BI to visualize this data. Power BI is great as it doesn’t require any knowledge of python or matplotlib to create visuals. One can create visuals with just CSV files or by connecting to a SQL server database.

Here, I created a Power BI report that visualizes the collected shot data. You can get a better look at the report here: https://app.powerbi.com/view?r=eyJrIjoiNDNkYTA4NmUtZDg2Ny00NTU3LThiNTgtNGRjODk5ZTlhZTA3IiwidCI6IjgyYTYyZDNhLWFmN2EtNDQyNi05Mzk3LWU0MDQ5MTk3M2U4OCJ9&pageName=ReportSection3a7086fe24fcf8e36894

This report highlights one very interesting insight: The Houston Rockets have the highest proportion of 3-point attempts and restricted area (shots at the rim) attempts — 80%, and the Phoenix Suns have the lowest at 58%. You can see the contrast in mid-ranged shot densities between the two teams — Phoenix (pictured right) taking a lot of midrange shots and Houston (left) with a lot less. Ironically or not, the Phoenix Suns had the BEST record in the league at 64–18 and the Houston Rockets had the worst at 20–62.

It just goes to show that while 3-point shots and shots at the rim yield a higher points per possession on average, this doesn’t translate directly to winning basketball games.

Thanks for reading! If you have any questions, critiques, or recommendations, let me know. Any feedback is much appreciated as I’m still learning many things.

Project jupyter notebook:

https://github.com/NammySosa/NBA-Shot-Data-2021-2022/blob/main/NBA%202021-2022%20Shots%20Data.ipynb

--

--