Modeling The NBA Leap — Part II
For the first part of this blog, please click here.
Jumping right back into this — we last left off with our cleaned data set made up of our qualified NBA players’ first three seasons statistics. This left us with around 1,300 rows of players and 100+ columns of basketball stats. As a reminder, our target variable is a player being selected to an All-NBA team in seasons four through six. Let’s begin looking deeper into our data now.
Exploratory Data Analysis
First let’s check on what our class distribution looks like:
We only have two classes: ‘All-NBA’ or ‘Not All-NBA.’ As you can see, per the barplot on the left, we have a major class imbalance. This makes sense as making an All-NBA team in seasons four through six is a major achievement and our target players are highlighted by the likes of Michael Jordan, Kobe Bryant, Lebron James, ect.
Due to most of our data being continuous — we can start to look at our data in scatterplots, coding our target players to be a different color to see if they are statistically different than their peers. Let’s check out some basic cumulative statistics in the pairplot below.
Our pairplot shows cumulative point, assists, rebounds, games started and minutes played across a players first three seasons combined plotted against each other. The light green dots represent non All-NBA players while the darker blues are All-NBA. As we move towards the upper limits of both the x and y axises, you start to see more and more of those blue All-NBA data points appear. This tells us, that even from the some of the most basic statistics, All-NBA players begin to separate themselves from the rest of the league.
Looking further, now we get into some of the more advanced statistics. If you remember, these statistics were not tracked until 70’s. These type of statistics really begin to differentiate All-NBA players from their peers. The 3D scatter plot on the left plots VORP (value over replacement player), PER (player efficiency rating) and WS (win-share) against one another, and again, like the basic statistics, we can really see the differences in the type of NBA players we are targeting.
With these statistics combined, we can start to paint the picture of what makes All-NBA players. Can we further identify? Let’s dive into see if any sort of year-over-year growth can help us predict All-NBA players.
As examples, I have plotted one advanced statistic (VORP) and one more basic statistic (PPG — points per game). Ideally, we would want to see the majority of the the blue — All-NBA players in the top right corners of both plots, as the x-axis represents difference between seasons 1 & 2, and y-axis is seasons 2 & 3. We do see some separation with both plots, but not as much as when we just look at the statistical totals from the earlier scatter plots. This could potentially show us that All-NBA caliber players come into the league ready to contribute immediately and might not need to show large improvements year-over-year.
I did not anticipate to write an entire blog post on just my EDA of this project — but I bet you can guess what my favorite part of any data science project is. Visuals can help identify what features drive a target and also bring data to life for those who don’t like to get down into the weeds.
A full notebook of code for these visuals and many, many others can be found here!