Predicting NBA All-Stars Using Performance Metrics and Supervised Learning

Published in

Madison John

11 min readMar 10, 2020

Photo by sergio souza from Pexels (rotated)

This article summarizes the process and findings of a study to use supervised learning algorithms to predict which players will be selected to play in the annual All-Star Game for the National Basketball Association.

00 BACKGROUND

Here is a brief description of the topics I will be discussing in this article:

NBA

The National Basketball Association is a professional men’s basketball league that has 30 member teams, 29 in the United States and 1 in Canada.

All-Star Game

The All-Star Game is an annual exhibition game that showcases the league’s best stars.
The aim of this project is to predict those “best stars” as accurately as possible using only performance metrics.
The catch is that, in reality, players are selected via a combination of fan, media, player, and coach voting.

Supervised Learning

Supervised Learning is a form of machine learning in which algorithms are tasked with mapping input values to output values.
For this study, the input values are the various player statistics, and the output values are All-Star status (True or False).

01 DATA SET

Data Source

The input variables data set was downloaded from Kaggle and was collected by Omri Goldstein.
The output variables data was manually collected directly from Basketball-Reference and merged with the Kaggle data set.

Features & Observations

After dropping blank rows and columns, the Kaggle data set contained 23,805 observations (rows) and 50 variables (columns).
The number of players in the NBA has steadily increased since the inaugural season in 1950 with a low of 92 in 1961 and 492 in 2014.

The number of All-Star players per year has changed little in the same time period, with a minimum number of 19 and a maximum of 27.

02 MODEL PREPARATION

Data Cleaning — Dropping Data

Blank Rows & Columns: As mentioned previously, blank rows and columns were dropped from the data set. These appeared to serve as visual separators, so removing them does not impact any data analysis.
Irrelevant Years: Data from 1950 and 1999 were dropped due to the first All-Star game being held in 1951, and no All-Star game being held in 1999 due to player lockout.
Miscellaneous Characters: Non-standard characters, e.g.: asterisks, were removed from player names. These characters initially prevented accurate merging of the input and output variables into a single data set.

Data Cleaning — Missing Data

20 of the 50 columns have ~10% to ~15% NULL values. These columns coincide with player statistics that were not collected or calculated in the earlier years of the league’s operation.

Example 1: Rebound breakdown statistics (DRB, DRB%, ORB, ORB%) became available starting with the 1974 season.
Example 2: The 3-point shot was adopted by the NBA starting with the 1980 season so no related statistics (3P, 3P%, 3PA) are available before then.
These columns will not be dropped from the data set as they are advanced metrics that could factor into the prediction, especially in more modern years.

Data Exploration

Let’s take a look at some metric groups to get a sense of how All-Stars and non-All-Stars tend to perform.

Basic “Count” Metrics

Not surprisingly, All-Stars on average score more points (PTS), collect more total rebounds (TRB), and record more assists (AST), steals (STL), and blocks (BLK).

Shooting Percentages

The various shooting percentages (number of made shots divided by the number of attempted shots) do not appear to be factors in All-Star prediction.
There is great overlap in all the plots below.

Win Shares

Win shares¹ (WS) is defined as the estimated number of wins contributed by a player.
On average, All-Star players accumulate greater win shares than non-All-Stars.

Player Ratings

The last group collects various advanced ratings for each player.
Box Plus/Minus² (BPM) is an estimate of the points per 100 possessions contributed by a player above an average NBA player.
Value Over Replacement Player² (VORP) is similar to BPM except the comparison point is a replacement player rather an than average NBA player. A replacement player is one with a BPM of -2.0.
Player Efficiency Rating³ (PER) is a per-minute summation of a player’s positive and negative accomplishments.
As expected, All-Stars tend to have higher ratings than non-All-Stars, though, as in previous metric groups, there is much overlap.

Feature Engineering

Normalization

Each continuous variable was normalized against the maximum value of that variable for each year.
Variables with negative values (win shares, player ratings) were excluded from normalization.

By normalizing the continuous variables, their ranges become identical, from a minimum of 0 to as high as 1.

Season Score

The season score feature (SsnScr) was generated by calculating the mean of all the normalized metrics.
Additionally SsnScr itself was normalized against the maximum value for each year, producing SsnScr_norm.

Defining a feature such as season score allows for reduction of features, encapsulating the data from many variables into as few as possible. In the case of the NBA data, 41 variables was reduced to 2 (SsnScr and SsnScr_norm).

The plot below shows the results of this process: All-Stars on average have greater SsnScr and SsnScr_norm values than non-All-Stars

Position Group

Historically, there have been 5 standard position designations for NBA players: Center (C), Power Forward (PF), Small Forward (SF), Shooting Guard (SG), Point Guard (PG)
The data set, however, has many combinatory positions such as SG-SF, PF-C, PG-SG, and several more.
To simplify, and since All-Star players are designated either as front court or back court players, a new feature PosGrp was added to sort players into BackCourt and FrontCourt groups.

Feature Selection

The 12 variables (from a total of 50!) selected as features for classification were as follows:

Engineered Variables

SsnScr
SsnScr_norm
PosGrp

Non-normalized Variables

PER — Player Efficiency Rating
WS — Win Share
OWS — Offensive Win Share
DWS — Offensive Win Share
WS/48 — Win Shares per 48 Minutes
BPM — Box Plus/Minus
OBPM — Offensive BPM
DBPM — Defensive BPM
VORP — Value Over Replacement Player

03 CLASSIFICATION

Original Data Set

Cross-Validation⁴

Using Stratified KFold⁵ with shuffle=True and 10 splits, the cross-validation results show that the scores across folds is fairly consistent.
Bernoulli Naive Bayes⁶ (bnb_unb) appears to be the worst algorithm with scores of 0.84–0.86
Random forest⁷ (rfc_unb), gradient boosting⁸ (gbc_unb), and k-Nearest-Neighbor⁹ (knn_unb) all have much higher scores of 0.96–0.97.

Confusion Matrix¹⁰

False negative: Approximately a third to almost half of predicted non-All-Stars, as predicted by all algorithms except Naive Bayes, were actually All-Stars.
False positive: For the Naive Bayes algorithm, 16% of all predicted All-Stars were, in fact, not All-Star players.

Class Imbalance

The high rate of false negatives above illustrates a common problem with classification tasks. Data sets are often not balanced, that is, the distribution of observations among the classes is unequal.
Class imbalance can result in machine learning algorithms favoring the majority class, especially in cases with severe bias.
The NBA All-Star prediction task is one such case with extreme bias. ~95% of the observations in the data set has an output of False for All-Star.

Balanced Data Set: Under-sampling Majority Class

One approach to dealing with class imbalance is to sample the majority class until the observation count is equal or closer to the number of observations in the minority class.
The more skewed the data set is, however, the more of the majority class is excluded, which can limit the effectiveness of the model.
In order to get a 1:1 ratio the number of non-All-Star observations was reduced from 19,255 to 1030, excluding ~95% of the class data.

Cross-Validation

There is a 6%-8% worst-case overfit across all algorithms.

Confusion Matrix

The false negative rates have fallen relative to the unbalanced data set.
At the same time, the false positive rates have increased.

Balanced Data Set: SMOTE-Balancing

Another approach to class balancing is SMOTE¹¹, or Synthetic Minority Oversampling Technique.
As illustrated¹² below, SMOTE works by drawing lines between real observations and their n-nearest neighbors. Along these lines, new samples are randomly generated.

The plots below show the original unbalanced data set (left) and the SMOTE-balanced version (right). Visually, it appears that the spaces between All-Star data points are “filled in”.
Note that while the plots below are two-dimensional, all twelve of the features previously listed in this article were fed into the SMOTE balancing process.

Cross-Validation

The cross-validation results now resemble the scores observed using the original, unbalanced data set. At worst, there may be a 1%-2% overfit across folds.

Confusion Matrix

More importantly, the false negative rates have been significantly reduced.
The false positive rates have been reduced for the random forest, gradient boosting, and nearest neighbor algorithms relative to the under-sampled data set, though not to the levels shown with the unbalanced data.
Though Naive Bayes suffers from a much greater false positive rate, SMOTE is an acceptable balancing method for the other three algorithms.

04 EVALUATION

Feature Importance

Unsurprisingly, SsnScr and SsnScr_norm are the most influential features¹³ for random forest classification, followed by WS, PER, and VORP in some order for each classification run.

Evaluating on New Data

Prediction Results (2018, 2019)

Though the SMOTE-balanced model performed well during training with scores > 0.97 across 10 folds, it predicted poorly on new data.
Specifically, executing prediction on data from the 2018 and 2019 seasons resulted in ~40%–50% false positives.
One possible explanation is that the model has no effective means of limiting the number of players selected per year, so a current player further down the rankings but with similar numbers to a past All-Star may be positively (but falsely) predicted.

New Features

Various attempts to reduce the false positives had little to no impact in reducing false predictions, though the cross-validation and confusion matrix results had similar values to previous classification runs without the new features below.

AllStarCount: a count of the number of All-Stars per year.
Percentile: a ranking based on SsnScr_norm, split by PosGrp
Top24: a Boolean value based on SsnScr_norm indicating if a player was in the top 24 of their PosGrp
SsnScr_norm_x_PER/WS/VORP: interaction variables between SsnScr_norm and PER, WS, and VORP respectively.

Updated Feature Importance

Top24 is consistently the most important feature followed by two of the interaction variables SsnScr_norm_x_PER and SsnScr_norm_x_WS.

Data Overlap

The plot below shows the 2018 and 2019 measurements for the top 3 most important features. Note the overlap between All-Stars and non-All-Stars.

The plot below shows the predicted All-Stars for 2018 and 2019 as ordered by SsnScr_norm_x_PER and color-split by actual All-Star status, with blue bars indicating false positives.

Note that while the top-5 ranked players in each PosGrp-Year plot are consistently selected, beyond that, a player in the top 20 of their positional group is not guaranteed a roster spot on the All-Star team.

Conclusions

Though the prediction accuracy is not what I hoped, the model in its current state is still quite useful as a means to narrow down the player pool to the “best of the best” with regards to performance on the basketball court.

Some example use cases:

performance-based reference for coaches seeking to reward performance over popularity or for neutral fans attempting to be fair-minded
fueling commentator or fan discussions regarding who deserves an All-Star nod, who was snubbed, or who is having a better season
identifying flaws in the voting system (the NBA had recently made adjustments to the voting formula in response to the increasing power of social media campaigns)

05 FUTURE WORK

Model Improvement

This has been an interesting project to say the least, and the unexpectedly poor prediction accuracy (after the promising cross-validation results!) only motivates me to improve upon my work.

Moving beyond player performance metrics, I would like to investigate adding variables to measure player popularity and visibility, e.g.:

home city population
social media mentions
media coverage

Multi-Class Classification

The focus of this study was to predict whether a player would be selected as an All-Star, a simple Yes or No, True or False problem.

After improving the model’s prediction accuracy, I would like to rework the model to not only be able to predict whether a player is selected as an All-Star, but also to predict whether they would be a starter or bench player.

This would result in a classifier consisting of the following classes:

All-Star Starter
All-Star Bench
Non All-Star

Naive Bayes Investigation

Even with class balancing, the Naive Bayes classifier had the lowest cross-validation scores and highest rates of false positives and false negatives.

Whether this is due to the variable independence assumption or some other cause is worth investigating to understand if Naive Bayes could be useful for application to the NBA All-Star prediction problem.

06 REFERENCES

[1] NBA Win Shares. Basketball-Reference.
[2] About Box Plus/Minus, Version 2.0. Basketball-Reference.
[3] Calculating PER. Basketball-Reference.
[4] cross_val_score. sklearn.org
[5] StratifiedKFold. sklearn.org
[6] BernoulliNB. sklearn.org
[7] RandomForestClassifier. sklearn.org
[8] GradientBoostingClassifier. sklearn.org
[9] KNeighborsClassifier. sklearn.org
[10] confusion_matrix. sklearn.org
[11] SMOTE Oversampling for Imbalanced Classification with Python. Machine Learning Mastery.
[12] The-schematic-of-NRSBoundary-SMOTE-algorithm. Research Gate.
[13] feature_importances_. sklearn.org