Predicting the next NBA All-star

Prince Okpoziakpo
INST414: Data Science Techniques
7 min readFeb 15, 2023

Is an All-Star nomination a good metric for measuring players’ success in the NBA?

The National Basketball Association (NBA) host an event called NBA All-Star Weekend where out of ~600 players, 30 of, presumably, the best basketball players in the world participate in the All-Star Game. With the exclusivity of this event in mind — only 5% of players participate — it’s easy to assume that an NBA All-Star nomination is a “good indicator” of how well a player is performing in a season. But, is it? This article explores that assumption and the factors that impact a players’ likelihood of being nominated to play in the All-Star game. The insights from this article have been prepared with two audiences in mind: the Fans: NBA fans who want to know if their favorite player(s) will be an All-Star, and those interested in placing bets on said players; and the Organizations: NBA personnel keen on investing in valuable players.

All the code and data used for this analysis can be found here: predicting-the-next-nba-all-star-repo.

Data Description, Sources, and Collection Methods

Player statistics from the 2021/2022 regular season was used, and this data was gathered from basketball-reference.com, an enormously popular resource for a plethora of basketball statistics, well-known for their detail and accuracy. The website contains data on the NBA from as early as 1949–1950 season, as well data on season leaders, MVP winners, and All-Star nominees. basketball-reference.com makes its accessible via Excel and CSV files that are directly downloaded from several of its pages.

For each player, the website documents over a dozen statistical categories, primarily numerical variables, such as: Points Per Game (PTS), Field Goals Made (FGM), Field Goals Attempted (FGA), Field Goal Percentage (FG%), 3-Pointers Made (3PM), 3-Pointers Attempted (3PA), Free-Throws Made (FTM), and Free-Throws Attempted (FTA). Some categorical variables were included as well, such as: Team and All-Star. Each of these categories, and others, was used during the data analysis process to identify factors that impact a player’s chances of becoming an All-Star.

Exploratory Data Analysis

The data was loaded and analyzed using the pandas library.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load the NBA player stats for the 2020/2021 NBA Season
nba21_22 = pd.read_csv("data/2020-21-NBA-Player-Stats.csv")

# Extract a subset of statistical categories for analysis
stats = ["Player", "Age", "Tm", "Pos", "MP", "PTS", "FG", "FGA", "3P", "3PA", "AST", "TRB", "BLK", "STL", "FT", "FTA"]
nba21_22 = nba20_21.loc[:, stats]

Descriptive Analysis

The first step in gathering insights into the dataset is measuring the spread (or ‘variability’) of each of category using the minimum value, the maximum value, the variance, and the mean.

summary = nba21_22.describe() # summarize the data
summary = summary.apply(lambda a: round(a, 2)) # round numeric values
summary = summary.loc[["count", "min", "max", "mean"]] # extract subset of summary

# Render a table showing the summary
fig, ax = plt.subplots(figsize=(15, 15))
ax.table(
cellText=summary.values.tolist(),
rowLabels=summary.index.tolist(),
colLabels=summary.columns.tolist(),
cellLoc='right',
loc=0,
fontsize=55.0
)
ax.axis('off')

The table shows that there were 812 active NBA players during the 2021/22 season. Furthermore, there is a huge gap between the min and max values, and the average/mean varies largely depending on the category. For example, the minimum Points Per Game for an NBA Player is 0.0, whereas the maximum is 32.0. This means that there is a high variability in the data.

Data Cleaning

The dataset from basketball-reference had issues with its formatting, therefore the data had to be cleaned prior to analysis.

According to the dataset, there are 13 different positions. This is because in the “modern-era” of the NBA, teams are opting for model of play called “position-less” basketball. In this model, players will play multiple positions throughout the season. For simplicity sake, this 13-item list was narrowed down to match the official positions of the NBA, and the positions that were available at the All-Star game: Point-Guard (PG), Shooting-Guard (SG), Small-Forward (SF), Power-Forward (PF), and Center ©.

# Iterate over the player records, and set their position to be 
# their "primary" position
for index in nba21_22.index:
pos = nba21_22.at[index, 'Pos']
match pos:
case 'SG-PG':
nba21_22.at[index, 'Pos'] = 'SG'
case 'PG-SG':
nba21_22.at[index, 'Pos'] = 'PG'
case 'SG-SF':
nba21_22.at[index, 'Pos'] = 'SG'
case 'SF-SG':
nba21_22.at[index, 'Pos'] = 'SF'
case 'PF-SF':
nba21_22.at[index, 'Pos'] = 'PF'
case 'C-PF':
nba21_22.at[index, 'Pos'] = 'C'
case 'PF-C':
nba21_22.at[index, 'Pos'] = 'PF'
case 'SG-PG-SF':
nba21_22.at[index, 'Pos'] = 'SG'

Another issue with the dataset was that some players were repeated in the dataset. This isn’t unexpected, as the NBA Trade Deadline is just before the All-Star game, and it isn’t uncommon for multiple players to switch teams mid-season. For simplicity, players with duplicate records were included in the dataset.

Lastly, the dataset didn’t have All-Stars labeled, so players had to be manually labeled as All-Stars. The 2021–2022 NBA All-Star list was collected from the newsroom sports company The Athletic. [Explain reliability of source].

all_star_21_22_list = """
• Stephen Curry,
• DeMar DeRozan, Bulls
• LeBron James, Lakers
• Giannis Antetokounmpo, Bucks
• Nikola Jokić, Nuggets
• Chris Paul, Suns
• Fred VanVleet, Raptors
• Luka Dončić, Mavericks
• Donovan Mitchell, Jazz
• Darius Garland, Cavaliers
• Jimmy Butler, Heat
• Jarrett Allen, Cavaliers
• Trae Young, Hawks
• Ja Morant, Grizzlies
• Jayson Tatum, Celtics
• Andrew Wiggins, Warriors
• Joel Embiid, 76ers
• LaMelo Ball, Hornets
• Dejounte Murray, Spurs
• Zach LaVine, Bulls
• Devin Booker, Suns
• Khris Middleton, Bucks
• Rudy Gobert, Jazz
• Karl-Anthony Towns, Timberwolves
"""

# Split the String into a list. Items are "{Player}, {Team}"
all_star_21_22_list = all_star_21_22_list.split('\n')[1:-1]

# Iterate over the Player/Team list
for i in range(len(all_star_21_22_list)):
player = all_star_21_22_list[i].split(',')[0][2:] # Extract players' name
all_star_21_22_list[i] = player # Update the list

nba21_22["All_Star"] = np.zeros(shape=(812, ), dtype=int) # create All-Star column
# Iterate over the player stats list
for i in nba21_22.index:
row_i = nba21_22.iloc[i] # get the ith record
name = row_i["Player"] # get the players name
if name in all_star_21_22_list: # check if the player's an All-Star
nba21_22.at[i, "All_Star"] = 1 # label the player as an All-Star

Figures and Findings

The first insight gathered was gathered by comparing the statistical performance of the average All-Star to those of non-All-Stars.

primary_stats = ['PTS', 'AST', 'TRB', 'BLK', 'STL', 'MP']
labels = [
'Points Per Game',
'Assists Per Game',
'Total Rebounds Per Game',
'Blocks Per Game',
'Steals Per Game',
'Minutes Played Per Game'
]
i = 0
for row in range(2):
for col in range(3):
s = primary_stats[i]
# Plot the league stats
axes[row][col].hist(
nba21_22.loc[:, s]
)
axes[row][col].set_xlabel(labels[i])

# Plot the All-Star average mean, min, and max
_min = all_star_21_22.loc[:, s].min()
_max = all_star_21_22.loc[:, s].max()
_mean = all_star_21_22.loc[:, s].mean()

y_min, y_max = axes[row][col].get_ylim()
axes[row][col].vlines(
[_mean],
y_min,
y_max,
colors='r'
)
i += 1
Comparison of Average All-Star to non-All-Stars

The figure above shows a comparison between the average All-Star and non-All-Stars in each of the 6 primary statistical categories. As the data shows, per category, the average All-Star is always somewhere in the 60–90th percentile of the league. This shows that All-Star players are clearly high-performing.

The second insight was that the category that stands out the most is the Minutes Played Per Game. The category On average, All-Stars player more minutes than 90% of the league. The impact of this is statistic is clear, as player’s need more time on the court to generate the numbers that would let them be considered All-Stars in a season. But, does more time on the court guarantee that a player will generate All-Star caliber performances?

Visualizing the Correlation Between Minutes Played and Player Performance

The figure above shows that, for each category, there is a strong positive correlation between the number of minutes played and player performance.

Limitations

Throughout the analysis, there were three limitations that may have impacted the accuracy of the resulting conclusion: (1) Sample size, (2) Available data types, and (3) Methods of Categorization.

Firstly, the entirety of this analysis was based off the 2021/2022 NBA Season. There have been over 70+ seasons, and this single-season sample is not enough to determine the variables that factor into an All-Star nomination.

Secondly, statistical performance is important for an All-Star nomination, but, the popularity of a player also plays a significant role on their chances of nomination. Player-sentiment is crucial for determining which players are actually fan-favorites, and more likely to get voted into an All-Star game. This isn’t encapsulated in just their game-to-game statistics, but is captured in the textual data generated on social media, blogs, newsrooms, and other sources of natural language data.

Finally, the metrics used, such as mean, deviation, and minima and maxima, are incredibly basic statistics. In other words, while these statistics are important for preliminary analysis, they fail to encapsulate the detail of a player’s performance during the season that affect their chances of being nominated as an All-Star. In the future, more advanced tools and methods (i.e., clustering using Machine Learning) could be used to get more accurate results about player performance.

Conclusion

In conclusion, player performance is affected by the amount of time they spend on the court. In order for players to generate the same numbers as the average All-Star, they have to play over 35 minutes per game, which is more than what 90% of the league plays. Furthermore, as players are more likely to be nominated as an All-Star if they average better than 60–80% of the league in at least two categories, players should focus on playing a singular position, i.e., a “pure” Point-Guard or “pure” Shooting-Guard. This way, players can generate more value in the categories that come naturally from playing those position, such as a Point-Guard average more points and assists while playing as a “pure” Point-Guard.

--

--