On December 15, 2020, after consecutive NBA MVP wins, Giannis Antetokounmpo signed the largest contract in NBA history worth $228 million over five years. This nine-figure contract came after multiple years of success in Milwaukee from the “Greek Freak,” and may result in a future championship. Large deals like this have become a fixture in basketball, and there are no signs of stopping. However, as in all professional sports, certain contracts work out better than others. When these deals are unsuccessful, teams are left paying large sums for players that should be on the bench. Professional teams have an idea of a player’s worth, but to sign them, managers inflate the numbers to ensure their signature goes on the paper, and stars play in their team’s jersey. I believe teams looking to build the best team must ensure they manage their money well. A player’s statistics should influence their salaries and be the basis. Using Python’s machine learning capabilities to create a relevant salary projection using individual player statistics. The projections will evaluate the fairly paid, overpaid, and underpaid from the 2019–2020 season.
NBA Salary Cap — History and Explanation
A salary cap in professional sports is used to limit individual teams’ spending to maintain “competitive balance across the league” (Miller, 2018). Before the cap’s inaugural season in 1984, teams could spend as they will sign players. By implementing the cap, the league has tried to prevent championships from being bought by the most successful team. Determined in the current Collective Bargaining Agreement (CBA), players will receive “between 49 to 51 percent” of the basketball-related income, or BRI.
The BRI is generated from revenue across the league through tickets, broadcast rights, sponsorships, and many other things. Due to the large increase in league interest and broadcasting, the cap for the 2020–2021 season will be $109,140,000 (Salary Cap Rumors | Hoops Rumors, 2017). The cap determines the maximum team payroll and force strategy amongst the distribution of payments. And if a team exceeds the cap, they will be penalized.
To avoid these penalties, teams must either make sacrifices by cutting or trading players or plan ahead long-term. As many know, a team planning long-term can be thrown off by an injury or disgruntled player. From Derrick Rose’s injury history derailing his large deal to Kyrie Irving demanding a trade away from LeBron James, things happen and force change. As such, teams need to find players who can fill the holes or slow the leaking, but they must do so with cap constraints. However, through advancements in sports analytics, teams can find “discount” players using machine learning and statistics.
Data Harvesting and Cleaning
Python’s machine learning capabilities made it the most appropriate language for this project, and as such, Pandas, NumPy, CSV, Matplotlib, and Sci-kit Learn were imported to help shape and run the data efficiently. Specifically, Scikit-Learn’s library of models provided clear outlines for designing the system. However, to fill this system, data was collected from NBA Stats, Hoops Hype, and a dataset developed by Chris Davis.
NBAStats.net has provided access to the seasonal statistic for every player since 1985. The purpose of this data was to match a player’s statistics to his salary. Theoretically, a player’s pay should be directly derived from their play. Two cleaning actions occurred on this data. The first was to specify each player’s primary position and eliminate the non-primary. The downloaded CSV file had over 26 different positions, and I believed that their position might impact the payment. As such, I used Excel to correct the issue. The second issue, NA values were replaced with 0s to avoid errors in the model’s training. The statistics data will be the independent variables in the projections. In addition to player statistics, the salary data came from Hoops Hype, and a Data-world imported CSV file made by Chris Davis. The only change to this data was converting the salaries from strings with a “$” to a numeric. Combining the salaries datasets provided every yearly pay by every NBA player since 1985. Using the Pandas merge function, a complete year-by-year data frame with both salaries and player statistics. Following the data frame’s completion, I wanted to evaluate trends in the market to gather a logical understanding of that data.
Exploratory Data Analysis
An exploratory data analysis analyzes datasets to view their main characteristics using graphs and charts. In the full dataset used for this project, I wanted to understand the distribution of positions and salary growth in the NBA.
The positions’ distribution helped gather insights on the data set players and explore the relationship between their salaries & positions. Several players had two primary positions, such as forward/guard or forward/center from the data given. For this description, we will not consider them in the rankings. From the graph above, the largest position group is shooting guards and then small forwards. The smallest position group is point guards and then centers. Following the understanding of this distribution, it intrigued me to see which of these positions had the highest median salary.
Power forwards have the highest-paid median salary at $2.06 million, while the lowest-paid position has been point guards. While you see guards, such as Ben Simmons, sign contracts that pay over $35 million a year, they are not the typical positional player. Outstanding players deserve the pay they earn. The power forward position has a median yearly pay of $2.06 million. The median pay for these backcourt players is substantially less than the frontcourt, precisely a difference of ~$404 thousand. Considering these positions’ counts show that there are more frontcourt players in the league than backcourt, supply and demand law applies. The larger the number of players available, the less their salary will be at the median. After considering these findings, it’s essential to understand the growth of salaries throughout the years.
The data set contains data from approximately 1980, and since then, NBA player median salaries have grown more than 500%. The increase has come due to an increase in interest in the NBA. While there’s evidence that the NBA’s viewership has declined, but the NBA has shown an appeal to a “larger overall audience than the NFL, especially with younger fans” (Raphael, 2019). In the 2018–2019 season, the NBA set a record for sold-out games with 760 for the fifth year in a row. This trend suggests that the league will only continue to grow, and as such, the salaries will too. This exploration of the NBA’s salary growth and positional salaries has yielded interesting findings, especially the proof of the law of supply and demand that appears in the league. After the conclusion of the exploratory data analysis, the evaluation models were built.
Building the Model
The purpose of this experiment is to evaluate the salaries of the past using a machine learning algorithm. The model should only evaluate the highest correlated statistics to salaries to accomplish this goal, following determining these relationships and training the models to create projections using only these values. The two models used in this project were logistic and linear regressions due to the theory of a direct relationship between performance and pay. The first step, however, was determining the highest correlated values.
According to this heatmap, some variables had stronger correlations than others. Based on past experiments, such as my NBA Playoff predictions, I sought correlations above 0.4. While this is a weaker correlation than preferred, the highest correlation is 0.52 points per game. Thus, this selected number of categories will yield the best projections, and those categories are shown in the heatmap below:
After filtering the data frame to only these variables, the data frame is now 12,749 rows by 17 columns. Under standard data analytics practice, the data was split into a training and testing set. Linear and logistic models built the projections for each player’s fair value salaries in the 2019–2020 season.
Last season, the NBA had 353 players under contract. As you can see in the table above, several players are undervalued in their current contracts, and several are overvalued. At the beginning of the project, I created three optimization questions: which players are fairly paid, which players are underpaid, and which players are overpaid? First, the fairly paid.
I set the parameters for fairly paid as between -5% and 5% valuation %, which measures the percentage that they are over or underpaid. According to the logistic regression model, several players were fairly paid last year. Damian Lillard and James Harden were the two standouts on this list as they have the group’s largest contracts. However, in the linear regression model, the closest to fairly paid were Reggie Bullock, Mikal Bridges, and Justin Holiday.
Next, I sorted the valuation % to view the most overpaid players last year. According to the linear model, the top three overpaid players were Matthew Dellavedova, Evan Turner, and Nicolas Batum. Respectively, their contracts were 1135.37%, 773.43%, and 763.43% overpaid for their projected performance-based salary. The logistic regression has views three stars as the most over-paid: Steph Curry, Russell Westbrook, and Blake Griffin. In hindsight, the model’s valuation does not take into account injuries. While this does not invalidate the entire set of projections, Curry only played five games last year to explain its low valuation. Likewise, Blake Griffin only played 18. However, Westbrook played 57 games and was given a projected salary of $1.1 million instead of the actual $40.2 million. As the lack of injury consideration causes some issues, it does provide a fair evaluation for “stars” that are more name than play.
Finally, in a reverse operation of the overpaid data frame, I found the most underpaid players under contract last year. According to the logistic regression model, the three most undervalued players are Chris Chiozza, Antonius Cleveland, and Yuta Watanabe. All three of these players were essentially paid the league minimum at $79568 and had a value of $1 million yearly wages. However, there appeared to be either a mistake or an eye-opening projection from the linear model. Three players: Cristiano Felicio, Joe Chealey, and Josh Gray, were projected to have negative salaries. These findings could mean that they should not have been under contract. Outside of these three, the most undervalued player was Drew Eubanks. He’s worth 97.51% more than his current yearly salary. He should be a hot commodity on the trade or free-agent market for any team looking for the most bang for their buck center.
The purpose of this project was to generate valuable projections on the fair-valuation of NBA players. After gathering, cleaning, and exploring the data, linear and logistic regression models were built to do these valuations using Python and Sci-kit Learn. As such, the models answered three questions: who is paid fairly, who is overpaid, and who is underpaid? Answering these questions in a professional front-office is how teams will efficiently build up their team and stay within the NBA’s cap-constraints. By doing such, NBA teams ensure they get the most bang for their buck.
Please feel free to leave comments on this article. I’d love to hear your thoughts and ideas!
Miller, K. (2018, August 7). How NBA Free Agency, Salary Cap Work. Bleacher Report; Bleacher Report. https://bleacherreport.com/articles/2787871-how-nba-free-agency-salary-cap-work
Salary Cap Rumors | Hoops Rumors. (2017). Hoopsrumors.com. https://www.hoopsrumors.com/salary-cap
Raphael, A. (2019, September 4). Opinion | The NBA’s popularity is rising above the NFL. The Breeze. https://www.breezejmu.org/sports/opinion-the-nbas-popularity-is-rising-above-the-nfl/article_4719537e-cf5e-11e9-a5d8-07af5ad436c6.html