Predicting the NBA MVP with Machine Learning

Building a machine learning model to predict the NBA MVP and analyze the most impactful variables.

Published in

Towards Data Science

13 min readSep 30, 2022

(Photo by Diane Picchiottino on Unsplash)

Every season, there is always a huge discussion about the NBA’s Most Valuable Player, the biggest individual award a basketball player can receive. And it is hard to explain the criteria in this award for someone who is not acquainted with the sport.

Something that already confuses some people is that the MVP is not an award for the best, but for the most valuable player in the regular season. Then, arrives the question: What does it mean to be more valuable?

It means the player with the biggest positive impact on its team. That means that the team’s performance is also a variable that influences this individual award, since it has to have a good record to corroborate this impact. For that, the team has to have a good support roster to help the MVP, since he can’t win every game by himself. However, the roster can’t be that good, because in this case they wouldn’t need the MVP to get to a good seed.

You can already notice how confusing and subjective this award is.

With this in mind, I decided to apply Machine Learning techniques to observe the patterns in the MVP’s choice logic, verifying which are the most important statistics in this choice, and if it would be possible to create a model that, in the end of the 2022–23 regular season, could predict the MVP before the official result is released.

All the code and data used are available on GitHub.

This text is a translation of the original article written in Portuguese, available here.

Understanding the problem

Before anything, it is important to understand how the voting system of the award works, to approach the data the best way possible.

We want the answer to the question: Who is going to be the NBA MVP?

So, one might think this is a classification problem, where the options are MVP or not-MVP. However, this approach would encounter several problems, since the number of not-MVP players is much bigger than MVP players, making the training and evaluation of the model difficult.

Therefore, we should approach this as a regression problem. But what variable to use as target? Let’s understand how the voting works to decide.

Currently, the voting for the MVP award is made by media members not affiliated with teams or players, where each chooses a vote for first place (10 points), second place (7 points), third place (5 points), fourth place (3 points) and fifth place (1 point).

The player with the most points is chosen as the MVP. However, it is not the total points that we are going to use in the regression, since the number of voters can vary in different years, what changes the total points possible. To eliminate this problem, the variable we will try to predict is the MVP Share.

The MVP Share is nothing more than the points obtained by a player divided by the maximum total possible points that could be obtained that year, having values from 0 to 1.

Thus, the player with the highest MVP Share will be defined as MVP by the model.

Data

The individual statistics used were:

Average statistics per game
Total statistics
Advanced statistics

Were also used:

Placement (seed) and team winning percentage
Voting Result for MVP (MVP Share)

In total, this study had 71 variables for each player.

The answer to the previous question “What does it mean to be more valuable?” has yet another degree of variability: it tends to vary with time. Using data from very old seasons ends up resulting in worse results for the prediction of the most recent ones, because what was thought and discussed about the MVP award is no longer exactly the same.

As the objective will be to predict the MVP for the 2022–23 season, data from the 2006–07 seasons onwards (until 2021–22) will be used for this study, totaling 16 seasons.

The data cited above were collected for all players who entered at least 1 minute in the season, and were taken from Basketball Reference.

Data Treatment

Even considering a regression problem, the number of players with a non-zero value in MVP Share is still small, being normally 10 to 14 per season, out of approximately 450 players. This characterizes an unbalanced dataset.

Therefore, in order to achieve a good performance of the model, we need to make some considerations. One possibility would be to select only those players who are in the well-known MVP Race, an unofficial 10-player ranking updated during the season. Or, only those with non-zero MVP Share (only for training, because in a real case we wouldn’t have this information).

However, this would in a way facilitate part of the model’s work, already selecting the best players and the most likely to win the prize. As the idea of this study is that the model can predict the MVP among all possible players, without any kind of external help, that is not what we will do.

So how do you narrow down the number of players, filtering them in order to make sure that none who might win the prize are removed? At this stage, we see the importance of having domain knowledge to direct the project in the best possible way.

Imagine this: a player averaging only 10 points per game is not going to win MVP, right? So that player can be removed from the base without any problem.

MVP stats from the last 16 seasons (Image by Author)

Or, a player who only played 30 out of 82 games, or who didn’t make the playoffs, and so on. Thus, we can establish some minimum criteria for players to be considered.

With that in mind, I conducted a survey of the lowest averages and statistics by an MVP in history to establish these minimum criteria (StatMuse made this process a lot easier):

Karl Malone was MVP in 98–99 with 49 games
Wes Unseld was MVP in 68–69 with 13.8 PPG and 10.9 FGA
Steve Nash was MVP in 04–05 with 3.3 REB
Moses Malone was MVP in 82–83 with 1.3 AST
Bob Cousy was MVP in 56–57 with 37.8% FG%
Giannis Antetokounmpo was MVP in 19–20 with 30.4 min
Kareem Abdul-Jabbar was the only MVP not to make the playoffs in 1976
Dave Cowens was MVP in 72–73 with a PER of 18.1
Never has an MVP been traded midway through the season he won the award

Using these values as base, we can filter out the vast majority of players, leaving only 20 to 30 per season. This is achieved with almost universal premises, without giving up impartiality.

These filters will make it much easier to evaluate the performance of the model, both in training and in test.

Now, with our database ready, we can start creating the model.

Modeling

Initially, several regression models were used:

Support Vector Machines (SVM)
Elastic Net
Random Forest
AdaBoost
Gradient Boosting
Light Gradient Boosting Machine (LGBM)

The performance of each one of them was evaluated using the root mean squared error (RMSE) and the coefficient of determination (R²).

Formulas for RMSE (left) and R² (right) (Image by Author)

RMSE calculates the root mean square errors between actual values and predictions. It is widely used because it is a metric that punishes large errors (when squaring), but it is in the same unit as the variable of interest (when rooting). That is, the lower its value, the better.

However, being in the same units as the variable can make it difficult to define what would be a good or acceptable RMSE value. Therefore, we will also use R² as an alternative metric.

R² represents the proportion of variance that was explained by the model, ranging from 0 to 1. That is, the closer to 1, the greater the variability of the data that can be explained by the model.

Results

At first, to better assess the capabilities of each model, the 2021–22 season was set aside for testing, while the remaining 15 were used as training.

With this, the RMSE and R² values of each model were obtained, in addition to defining the optimized parameters for the rest of the study.

At the end of this process, the following results were obtained:

RMSE and R² values obtained among the different models for the 2021–22 season (Image by Author)

SVM, Random Forest and LGBM, respectively, obtained the lowest values of RMSE and highest values of R². AdaBoost and Gradient Boosting also had adequate results, with Elastic Net having the worst metrics among the models used.

Now looking at the final ranking, a very positive point: 5 of the 6 models got Nikola Jokić as MVP:

Top 3 of the MVP dispute among the different models for the 2021–22 season (Image by Author)

Gradient Boosting was the only one that ranked Giannis Antetokounmpo as MVP. AdaBoost placed Luka Dončić in 3rd (he was 5th).

Other than that, all the others got the Top 3 players right. However, the models classified Giannis in 2nd place in the dispute and Joel Embiid in 3rd, when in the real ranking we have the opposite.

This can be explained by the fact that, despite Giannis having an excellent season in numbers and with a good performance from his team (Bucks with the same campaign as the 76ers, in front by the tiebreaker criteria), he suffered from what is known as voter fatigue.

Giannis has been MVP twice recently, and both the public and voters tend to prefer, even unconsciously, an emerging player in contention (like Embiid) over an already winner of the award (especially if it’s more than once).

LeBron has already suffered from this, and already anticipating what will happen next season, just one historic performance for Jokić, the award winner for the last 2 years, to win the MVP for the third time in a row.

Variables Analysis

One way to visualize which variables were the most important in the models’ predictions is by using SHAP Values (SHapley Additive exPlanations). SHAP is a game theory-based technique that allows you to explain reasonably well how much each variable is affecting the model’s predictions.

Let’s take the SVM model as an example, which obtained the best metrics:

SHAP chart related to the SVM model (Image by Author)

This chart has a lot of information, so let’s understand it:

On the Y axis we have the 20 variables with the greatest impact on the SVM model. Each point represents a player, and for each of these variables, there is a color gradient from red to blue, where blue means a low value and red a high value for the variable in question.

On the X axis are the obtained SHAP values. The more to the right, the greater the positive impact of that variable for the variable of interest (MVP Share). The more to the left, the greater the negative impact of the variable for the MVP Share.

For example, for the Seed variable: high values of Seed (bad placement) negatively affect MVP Share, while low values (good placement) affect it positively, which makes sense.

With this chart, we can draw some important conclusions:

The three most impactful individual variables are: PER (Player Efficiency Rating), WS/48 (Win Shares per 48 minutes) and BPM (Box Plus-Minus)
The two collective variables, Seed and PCT (percentage of wins) are among the 6 most impactful variables (2nd and 6th, respectively)
7 of the 11 most important variables are advanced statistics
Of the three most popular statistics (points, rebounds, and assists per game), only points per game appears on the list, at 16th place.

It is interesting to observe the predominance of advanced over common stats, as they separate the good performance of players much better, regardless of the position they play.

For example, a center tends to have a much higher rebound average than a point guard, who in turn tends to have a much higher assist average. Therefore, the model cannot distinguish an MVP player as well with these variables alone.

Win Share for example, which is a statistic that seeks to measure the credit of each player in the team’s victories, proves to be much more effective in this sense (the way this statistic is calculated can be found here).

PER, Player Efficiency Rating, is the most impactful variable in the model. It is a number that seeks to show the productivity per minute of a player, and it is one of the best statistics there is today. We can see that the model greatly valued high PER players in the red dots on the right. More details of the calculation procedures can be found here.

Another interesting point is the relationship found by the model in TOV (turnovers). Intuitively, we imagine that a player with a high number of turnovers per game would tend to have a lower chance of being MVP, right?

However, as an MVP usually has the ball in his hand a lot, his number of turnovers per game tends to be considerably greater than zero, a behavior that was captured by the model, contrary to common sense.

Additional Results

With the parameters already optimized with the most recent season and an analysis of the most important statistics, the purpose of this study was partially complete, now waiting only for the end of the 2022–23 NBA regular season (in April/23) so that the models can predict the MVP.

But why not explore them a bit more, to see what is their performance in older seasons?

Results obtained for all seasons since 2006–07 (Image by Author)

In terms of MVP determination, you can see that the models manage to perform well, with all getting right 12 out of the last 16 MVPs, with the exception of Random Forest who got right only 10 out of 16.

The drop in the RMSE and R² metrics is also significant, which has as one of its causes the natural variation of the criteria for the award over time, as mentioned previously.

Average of RMSE and R² values obtained among the different models for the last 16 seasons (Image by Author)

Only in 3 seasons most models (less than 3) missed the MVP : 2016–17, 2010–11 and 2007–08.

Perhaps the most surprising of those is the 2016–17 season, where Russell Westbrook was the first player since Oscar Robertson in 1961–62 to average a triple-double, leading OKC to the playoffs after Kevin Durant’s departure. It’s not disputed that the award should have gone to him, but the models chose LeBron James, Kawhi Leonard and James Harden, with only one choosing Westbrook as MVP. One of the factors for this may be the low seed of the team (10th in the league), which, as we have seen, is one of the most impacting variables on the result.

The same happened to Derrick Rose in 2010–11, but this time the four models that got it wrong converged on LeBron James to win the award.

Kobe Bryant in 2007–08 was the only season in which no model got the MVP right: they picked LeBron again.

In my opinion, that says a lot more about LeBron than of the other players. It shows that by the criteria of most valuable player, he is always in contention, and probably should have won the prize more times.

Photo of a drawing of LeBron James — (Photo by Howard Bouchevereau on Unsplash)

Conclusions

What was supposed to be just a model creation for predicting the next MVP has become a very interesting study on the very definition and history of the award. I hope it was useful and has helped to improve the understanding of the award and of the Machine Learning tools used.

Making predictions about sports is always a difficult task. However, the results were very satisfactory, and the models proved to be effective for predicting the MVP, with emphasis on the SVM, with better performance in the most recent season, and Gradient Boosting, with better performance in the older seasons.

It is worth noting that there is always room for improvement. For example, more variables could be added, such as seasons of NBA experience, draft pick number, and physical information (height, weight, wingspan). A marking with the number of MVPs already won, for example, could help the issue of voter fatigue observed.

Again, all the code and bases used are available on GitHub.

In April 2023 we will have model updates with entirely new data.
Will it get it right? We’ll see…

Photo of an basketball arena with a crowd — (Photo by JC Gellidon on Unsplash)

Edit 1 (Apr/23): With the regular season over, we can finally predict the 2022–23 MVP before the official annoucement!

Here are the results:

Top 3 of the MVP dispute among the different models for the 2022–23 season (Image by Author)

Surprisingly, Nikola Jokić is being predicted to win the MVP for the third time in a row by all the 6 models!

However, I doubt that will happen, as the narrative is very strongly in Joel Embiid’s favor.

In my opinion, we are witnessing another case of voter fatigue. As I said earlier in the text, Jokić would need a historical season to win this MVP, mainly because the public is “tired” of him winning MVP and not having success in the Playoffs. I guess that almost averaging a triple double wasn’t enough.

To summarize: Unfortunatelly, I believe the models will get it wrong! But in my opinion, if the MVP award had more objective and consistent over the years criteria, Jokić would be the winner. Let’s wait for the final result.

Edit 2 (May/23): Indeed, that’s what happened. Joel Embiid was named MVP!

This MVP race has divided NBA fans (again), with strong opinions for both sides. This result only shows how difficult sports predictions are, highlighting its main challenge: how to quantify non-trivial aspects of sports.

I’m always available on my channels (LinkedIn and GitHub).

Thanks for your attention!👏

Gabriel Speranza Pastorello