Predicting the Outcome of NBA Games Using Machine Learning

James Duan
Nerd For Tech
Published in
8 min readMay 7, 2021

Topic: This blog is an extension to “Exploratory Data Analysis of Home Team Advantage in the NBA 2004–2020” which can be found here. In the EDA, it was found that assists, three-point field goal percentage, and field goal percentage had the most impact on why NBA home teams were winning nearly 60% of the time. Using these conclusions as the basis for further analysis, we will attempt to use the game statistics to predict the outcome of NBA games through a variety of models.

Dataset: I used the same dataset as the EDA, which contained statistics for NBA games from 2004 to 2020. These statistics included points (PTS), assists (AST), rebounds (REB), field goal percentage (FG), three-point field goal percentage (FG3), and were separated by home and away teams. Null data and outliers were removed, leaving us with over 23,000 games worth of data. Additional columns for PTS differential (PTS_dif), AST differential (AST_dif), REB differential (REB_dif), FG differential (FG_dif), and FG3 (FG3_dif) differential between home and away teams were appended. Link to dataset can be found here or here. Refer to the EDA for more details on the data processing. Link to code can be found on my GitHub.

Examining Relationships Between Variables

Before doing any modeling, I first examined the relationship between my independent variables and my dependent variable. All of the predictor variables seem to have at least a moderately positive correlation with PTS_dif. FG_dif had the strongest correlation with PTS_dif with a correlation coefficient (r) of 0.769. Another good sign is that each variable was approximately normally distributed, apart from PTS_dif which is a little more sporadic. Another point of note is that there seems to be some collinearity between AST_dif and FG_dif.

Model 1: Multiple Regression

Based on the variables and data that we have, I will first create a model using multiple regression to predict the PTS_dif for an NBA game given the AST, REB, FG, and FG3 differentials. Because I concluded in the EDA that REB_dif did not make a huge difference in winning, I will first construct a regression model using AST_dif, FG_dif, and FG3_dif. The data is split into 70% training and 30% testing to measure model accuracy. This model has an R-squared score of 0.65 with a mean absolute error (MAE)of 6.21 (note that we use mean absolute error instead of root mean square error as a measure of accuracy because it is more consistent for larger data sets and less sensitive to outliers). For this model, the variance of the predictor variables is only explaining about 65% of the variance of PTS_dif. On average, the model misses the prediction by 6.21 PTS. Adding REB_dif as another independent variable increases the R-squared to 0.71 and lowers MAE to 5.65. It seems that with the multiple regression model, REB_dif actually improves the accuracy in predicting PTS_dif. The model that includes all the predictors explains about 71% of the variance of PTS_dif while missing the prediction by 5.65 PTS on average. The residual plot for this model looks really good with no signs of heteroscedasticity. Additionally, the distribution of the error seems to be approximately normal.

Overall the multiple regression model does a pretty average job of predicting the PTS differential of an NBA game. Based on the R-Squared, it would seem that I am underfitting and lack some other variables that might account for the variation in PTS_dif. This could be a statistic like turnovers or fouls. However, since the NBA does not provide these statistics in an accessible manner, it is very difficult to try and incorporate them into my model.

Model 1B: Multiple Regression Continued

Because I am unable to add more predictor variables, I attempted to improve the model by only considering games that were not close (PTS_dif greater than 4). Doing so improves the R-Squared to 0.75 at the cost of increasing MAE to 5.83. This makes sense because predicting games with a larger PTS differential will also result in a higher error. When I use this model to predict PTS_dif for games that were close (PTS_dif within 4), the results are extremely poor. The R-Squared was 0.09 with a MAE of 2.35 (the MAE is small because PTS_dif range is much smaller for games within 4 points). Based on these observations, it seems that this multiple regression model is not very good for predicting PTS differential in NBA games that are close, and only marginally better at predicting games that are not close. Going forward, it might be more plausible to predict the outcome (win/loss) of a game instead of trying to predict the PTS_dif.

Model 2: Logistic Regression to Predict Outcome

For the regression model, I began by establishing the baseline accuracy of 0.596 by simply guessing the most frequent outcome (home win). Using the dataset, I used HOME_TEAM_WINS as my target variable, which took a value of 1 for a home win and 0 for a home loss. Since all four predictors seemed to be important for multiple regression, I will use AST_dif, REB_dif, FG_dif, and FG3_dif as independent variables in the regression model. Just like with Model 1, the data is split into 70% training and 30% testing to measure model accuracy. Applying the logistic regression model to the test data gives me an accuracy score of 0.836 which is really good compared to the baseline accuracy of 0.596. The recall and f1 score on the classification report reveal that this model is significantly more accurate a predicting home wins than home losses. This trend is backed up by the confusion matrix, which has proportionally more false positive classifications than false negatives. This means that this model is biased towards home wins, which makes sense because our EDA revealed that the home team won nearly 60% of their games.

Overall, the logistic regression model, despite the bias, does a really good job of predicting the outcome of NBA games. However, before further exploring this model, I want to consider some other classifiers and see if there is another model worth using.

Model 3: Classifiers

Before using other classifiers to try and predict the outcome of the games, I standardized the data to ensure more consistency across the models. This normalized data was then split into 70% training and 30% testing data. The models I used included a generic bagging classifier, a random forest classifier, an adaptive boosting classifier, and finally a voting ensemble classifier. The accuracy scores for the models are shown below.

Using the same variables, none of the other classifiers tested exceeded the accuracy of the logistic regression. Something interesting of note is that the AdaBoost model gave more weight to REB_dif than the random forest, which gave more importance to AST_dif. Both Assigned FG_dif as the most important feature, with FG3_dif as the second. AdaBoost was more accurate than the random forest, and it assigned a smaller weight to FG_dif and AST_dif with higher weights on FG3_dif and REB_dif. This discovery was noted as it could potentially improve the logistic regression model.

Final Model: Logistic Regression Continued

Because none of the other classifiers were able to match the accuracy of Model 2, I decided to further explore the potential of the logistic regression model (sometimes the simplest model achieves the best results). One problem previously identified with the logistic regression model was the large proportion of false positives. To resolve this issue, and hopefully improve the accuracy of the model, I will normalize and resample the data. The original training data had about a 4:6 ratio of observations that were home loss to home win. This is the main reason for the false positives and the built in bias of the model. To fix this, I used both random over and under sampling (i.e. randomly removing home wins, and randomly duplicating home losses) until I got a ratio of 9:10 of home loss to home win. This should reduce the bias and false positives for the model, while still preserving the inherent ‘home court advantage’ that I discussed in the EDA. The confusion matrix compared to the original model shows a significant reduction in false positives. However, There was not a significant change in the overall accuracy of the model, aside from an improvement in recall score for Home_Loss at the cost of Home_Win.

I also attempted to use grid search for hyperparameter optimization. I used grid search for C, penalty, different solvers, and max iterations with a cross validation of 3 folds, testing a total of 2,700 fits. Surprisingly, there wasn’t really a difference in the performance of the ‘optimal’ model. The accuracy was the same as the base model of 0.836, and as the classification reports reveal, there is barely any difference between the two models. The major difference is in the resampled data model, where the recall score for home loss was much higher.

Base Model

Normalized & Resampled Model

Grid Search CV Model

Conclusion

It is hard to conclude which logistic regression model was the best since the accuracy was very consistent throughout. Despite trying different models, normalizing and resampling the data, and fine-tuning the parameters, I was unable to really improve the accuracy of the model (in terms of predicting the outcome of NBA games). This indicates to me that the major limitation in my model is the lack of additional predictor variables. Including another variable such as turnover differentials, foul differentials, or steal differentials could significantly improve the performance of the classification models. Again, this was difficult to do because of the limitations of my data set. In future models, I would try to gather more variables and further explore the voting ensemble classifier as another option, since it showed really good potential. Overall, with how the model is at this point in time, I am not comfortable using it as an accurate predictor of NBA games. I would want to get the accuracy to at least 90% before considering using this model. Other models have been able to achieve an accuracy score between 88–94%, so that is the benchmark.

Important Note

This model is not to be confused with other models which seek to predict the outcome of a game without knowing the actual game statistics. These models rely on statistical ratios of each team to predict the outcome of a game without knowing any statistics of the game itself. My model is more of a classification, where we are simply attempting the classify and win or loss based on the given ending statistics. That is why we are aiming for a much higher accuracy score.

--

--