Modeling The NBA Leap — Part 1
What is the NBA Leap and how do we define it? I think Zach Lowe, NBA Analyst for ESPN, puts it best:
“A player’s identity typically begins to crystallize in his third or fourth NBA season. Young players have learned the ropes, and veterans have departed or aged, vacating heavy-duty roles that need filling. Everyone involved — players, agents, executives — looks to see what emerges as a player nears the expiration of his rookie contract.”
More or less, the NBA Leap is when a young NBA player transitions from a productive teammate into a bonafide NBA star.
As crucial as these star players are to a team’s success, it’s also important for front offices to be able to identify future stars in order to financially plan the rest of the team around them. With this in mind, and in order to bridge the gap between contractual differences between 1st and 2nd round drafts and changes in the NBA’s collective bargaining agreement over the years our business question is:
Based off an NBA player’s first three seasons statistics and accolades, can we predict whether they will make ‘The NBA Leap’ to becoming an All-NBA player in seasons four through six?
Our metric of focus for this experiment will be the precision metric. We’ll be aiming to reduce reduce our false positives, meaning when our model predicts that a player will ‘make the leap’ they actually do. This will help front offices reduce the amount of ‘max or near max contracts’ to players who are not likely to develop into top 15 players in the league.
Being able to identify young talent in the NBA is essential for front offices. The first step is the draft, each year around 60 players are drafted into the league and more are signed through free agency. Dependent on what number pick players are drafted at, rookie contracts can stretch from 2–4 years, with more guaranteed years & salary slanted towards those drafted higher.
Life would be a whole of a lot easier for management if you could base player’s second contracts off of their draft positioning, but there’s a reason why this article of biggest NBA busts exists. First round draft picks offered a rookie scale extension usurp 25% of a teams salary cap. That’s 1 player out of a team of 15 taking up a quarter or more of a team’s salary — and there are typically kickers that can increase these numbers if the player makes All-NBA during their rookie deals. Celtic’s star Jayson Tatum just missed out on an additional $32 million by not being selected to this year’s All-NBA team.
Being selected to an All-NBA team typically means, you were one of the 15 best players in the league for that specific year. All-NBA awards have been a major part of the NBA since the league’s inauguration back in 1946.
Data Collection & Compiling
Now that we have our business question, importance and scoring identified, we can begin the data collection process. I sourced the majority of my data from two sources: Stathead.com & ESPN.com. I used selenium webdriver to scrape seasonal and advanced statistics for every NBA player dating back to 1946 from Stathead.com (This requires a premium subscription). This provided me with two tables of over 22,000 lines each of NBA statistical data. My notebook containing my web-scraping script can be found here.
From ESPN, I collected All-NBA awards, All-star selections, MVP awards, Rookie of the Year, and other major accolades that could help determine star NBA players. These would act as great categorical features compared to our continuous features collected from Stathead.com.
The most time consuming part of this experiment was cleaning and formatting my dataset in Python, highlighted by the following:
- Merging the two data sources together by unique identifier (player name).
- Subset the dataset to only include players whose rookie seasons were after 1977 — this was the year when the NBA began to record more advanced statistics such as VORP (value over replacement player), TS% (true shooting percentage, & PER (player efficiency rating).
- Further subset the data set to only include players who had played at least 6 seasons in the NBA, as our target variable covers season 4 through 6.
- Aggregated down the data so each line represented a player’s first 3 seasons in the league while the target represented seasons 4 through 6.
- Creating columns for each year’s separate statistics (season 1 ppg, season 2 ppg, ect.)
This left us with roughly 1,300 lines of data with 100 features to sink our teeth into before we dive into feature engineering.
More in depth demonstration of my data compiling techniques can be found here.
Next Steps — Part 2
With our dataset cleaned, formatted and aggregated we can move into the next phase of our experiment. In my next post, I’ll go over my exploratory data analysis, feature engineering, modeling techniques and results.
In the meantime, if you’re interested in jumping ahead, you can find my full project repository here.
Please reachout with any questions or suggestions!