Predicting the 2021–2022 All-NBA Rookie Team Using Machine Learning

9 min readFeb 4, 2022

*Analysis was done using data up until the first week of December 2021. Project was done for my BADM 453 Class (Business Intelligence)

In this study I attempted to predict the 2021–2022 All NBA Rookie Team using Machine Learning.

Exploratory and Predictive Objectives:

The independent variables are the numerical player’s statistics and the outcome variable is a binary classification problem determining if the 2021–2022 NBA rookie will make the All NBA-Rookie Team. 0 if the rookie failed to make the All-NBA Rookie Team and 1 if they were successful in making the All-NBA Rookie Team. Both the exploratory and predictive objectives will include the same outcome.

Sample, Data, and Variables:

My source of data is basketballreference.com which is pretty much the number one source of NBA statistics alongside the NBA website and ESPN. Each source of data has the same basic basketball statistics. Think Points per Game, Assists per game, and Rebounds per Game. They differ with their advance statistics. Each site has their own proprietary advanced stats. NBA.com has stats such as Offensive and Defensive Rating. Basketball Reference has Offensive Box Plus Minus and Defensive Box Plus Minus. I decided to move forward with Basketball Reference because of its ability to download the data as a CSV file

Background on Basketball Reference Advanced Stats

All from basketballreference.com

“BPM = “Box Plus/Minus, Version 2.0 (BPM) is a basketball box score-based metric that estimates a basketball player’s contribution to the team when that player is on the court. It is based only on the information in the traditional basketball box score — no play-by-play data or non-traditional box score data (like dunks or deflections) are included.

To give a sense of the scale:

+10.0 is an all-time season (think peak Jordan or LeBron)
+8.0 is an MVP season (think peak Dirk or peak Shaq)
+6.0 is an all-NBA season
+4.0 is in all-star consideration
+2.0 is a good starter
+0.0 is a decent starter or solid 6th man
-2.0 is a bench player (this is also defined as “replacement level”)
Below -2.0 are many end-of-bench players “

VORP = “Value over Replacement Player (VORP) converts the BPM rate into an estimate of each player’s overall contribution to the team, measured vs. what a theoretical “replacement player” would provide, where the “replacement player” is defined as a player on minimum salary or not a normal member of a team’s rotation. A long and comprehensive discussion on defining this level for the NBA was had at Tom Tango’s blog, and is worth a read. (Tom Tango is a baseball sabermetrics expert, and one of the originators of the replacement level framework and the Wins Above Replacement methodology common now in baseball.)”

WS/48 = “Win Shares Per 48 Minutes (available since the 1951–52 season in the NBA); an estimate of the number of wins contributed by the player per 48 minutes (league average is approximately 0.100). Please see the article Calculating Win Shares for more information.”

Usg% = ”Usage Percentage (available since the 1977–78 season in the NBA); the formula is 100 * ((FGA + 0.44 * FTA + TOV) * (Tm MP / 5)) / (MP * (Tm FGA + 0.44 * Tm FTA + Tm TOV)). Usage percentage is an estimate of the percentage of team plays used by a player while he was on the floor.”

TS% = “True Shooting Percentage; the formula is PTS / (2 * TSA). True shooting percentage is a measure of shooting efficiency that takes into account field goals, 3-point field goals, and free throws.”

The above are the most notable ones. Here is the link to the rest: https://www.basketball-reference.com/about/glossary.html

Data pre-processing:

For my data pre-processing, I started off by downloading each rookie stats for each year from the 2009–2010 season all the way to the 2020–2021 season. I then uploaded each tab to the Jupyter Notebook and renamed each column to the year and dropped any unnecessary rows for my analysis. For the next step of data-preprocessing, I used a very powerful web scraping library known as beautiful soup. The reason I needed to do this was because the rookie stats did not have any of the advanced stats I needed for the model. I scraped all the player stats for each respective year, both basic and advanced stats. The next step was to use Panda’s merging function to math the rookie’s name with their respective advanced stats. This populated a new dataframe. With the new merged dataframe, there were some duplicates, so I deleted any duplicates or any new no data columns. For any missing numbers, I imputed zero. The last step before inputting the data into the model was to convert the data into a numerical data in order for the model not to return an error.

Descriptive statistics

From the first graph, All-NBA rookies tend to be a lot younger than rookies that do not make the All-NBA rookie team. I think this is because unlike other professional sports leagues, the best caliber prospects only have one year of college experience and team’s are willing to invest in them by playing a lot of minutes. They play a lot of minutes because they tend to be underperforming teams. Prospects that are older than to get less minutes and viewed as less developed. As a result, their ceiling is a role player, so they get less minutes and do not make the all-NBA rookie team as frequently. The All-NBA rookie team is more tightly distributed in terms of age comparison to those who do not make the All-NBA rookie team as proven by the violin plot shown above. In terms of the minute distribution, the majority of the rookies play very minimal minutes; it is a very slim minority that play anything over 2000 minutes. These last three graphs show the correlation between field goal statistics, minutes played, and whether or not that rookie made the all-rookie team. The first one shows that with anything under around 250 field goals, the chances of making the all-rookie team are pretty slim, but the more field goals after that, the chances greatly increase. I also like the first of these series of graphs because it shows that as the field goals increase, there is a much tighter distribution of field goal percentage. I like the second two of these series because they show that minutes played are a precursor to field goal attempts as well as taking field goals.

Below is the correlation table showing how the independent variables correlate with the outcome variable(All Rookie Team). I could not fit the entire table in the writeup, but you can take a look at it and it is extremely apparent there is a lot of multicollinearity, which I will take into account when I address what the optimal model is. It makes perfect sense for the data to have a lot of multicollinearity. For example a lot of the stats need other stats for their computation. Field goal percentage is a computation using field goals and field goal attempts. The more minutes you play, the more points, rebounds, and assists you will accumulate. Most importantly, let’s take a look at what stats correlate most with the outcome variable in terms of r². The three most important: FG, FT, and PPG.

In another study I did, I studied the correlation between a team’s market cap and their rookie’s performance via Box Plus Minus and Value Over Replacement. Although, I did not use the team’s market cap as a feature in this study. It is interesting to see if rookie’s tend to perform worse in bigger market’s such as Chicago/NYC versus smaller markets such as New Orleans/Memphis.

Although the R² was below .05. There is a very small negative correlation between the team’s market value and the all-encompassing rookie stats such as Box Plus Minus and Value Over Replacement Player. In other words, the more valuable the franchise, the worse the rookies tend to perform.

Select Appropriate Methods for your Analysis:

Since my outcome was binary I decided to use classification type models from my predictions. As I mentioned before, since there was a lot of multicollinearity in my data I decided to move forward with non-parametric and tree-based models so that multicollinearity would not skew my predictions and results. The models I used are as follows:

KNN Classification
Random Forest Classification
XGBoost Classification

Validate Your Models:

Since my model was a classification problem, I used a confusion matrix to determine the accuracy and the best performing model. Of the baseline model and no adjustments the XGBoost was the best performing model with an impressive accuracy of 97% on training data. The other models were extremely accurate as well with an accuracy of 95% data. Just to clarify the training data are the seasons 2009–2010 through 2020–2021 with known outcomes, whereas the testing data is the current season, 2021–2022.

Perform Analysis and Report Findings:

KNN Results:

Random Forest Results:

XGBoost Results

Although the accuracies were somewhat different, the players chosen were nearly identical except Jalen Green was not included in the KNN model but was included in the other two models. In order to better understand the feature importance, let’s take a look at the SHAP value chart.

Similarly to the correlation table, PTS and FG were the most important features when determining the outcome variable. Analyzing this chart, The All-NBA Rookie team is very dependent on offensive efforts, specifically scoring, rather than defensive or assists/rebounding. The more prominent features were cumulative statistics rather than efficiency statistics. This just goes to show that rookies who score and play a lot no matter the costs will make an appearance on the NBA all-rookie team. This model does not consider the context to which the players accomplish points and minutes. A lot of these players are currently playing for losing teams.

This exemplifies the idea of putting empty stats on a losing team. For example, Michael Carter-Williams in 2013–2014, widely considered one of the worst NBA players in 2021, won Rookie of the Year. The reason, He got so much playing time was because the team around him was atrocious, hence putting empty stats on a losing team. This model also does not consider effort and other basketball qualities not accounted for in basketball statistics. This model punishes someone like Ayo Dosunmu who puts in a ton of effort, always in the right spot at the right time, and a very high basketball IQ. Models like this cannot take into account something like basketball IQ and other intangibles because there is no exact statistic for those.

The application of my findings can be used for sports betting as well as sports journalism. Sports bettors can use my All NBA-Rookie Predictions to find value in the NBA Rookie of the Year moneylines. The final take on this is using my above predictions,news, and the player’s environment to find value in the below Las Vegas moneyline odds. My two favorite value picks are Franz Wagner at +5000 and Josh Giddey +1500. My case for Franz Wagner is that he finished fourth in the NBA’s rookie ladder this week and he is playing for a really bad team. On really bad teams, rookies can be given a lot of responsibility to score. Same case for Josh Giddey in the sense of being on a really bad team and a lot of responsibility. I think they both have a chance to pass up Mobley, Barnes, and Cunningham and make huge jumps for the remainder of the season. Giddey and Wagner have significantly higher payouts too.

Predicting the 2021–2022 All-NBA Rookie Team Using Machine Learning

Written by Ship Analytics