Create an NBA Win-Loss Model w/ 68% Precision

The Research Lab
7 min readFeb 11, 2023

--

basketball art in the style of oil painting — DALL•E

With the NBA season past the Christmas break and the trade deadline behind us, it’s about time that activity begins to really heat up. We have the all-star game and get to see if Lebron, the all-time scoring leader, and the Lakers are able to make the play-in tournament. Given that all the fun is just getting started, I figured what better time to build your own Binary Classification win-loss model.

Some of the model features used to predict were inspired by Chris Baker and Stephen Shea’s book Basketball Analytics, and I would highly recommend reading it if you are interested in learning more about the history of basketball analytics and methods for analyzing the game. The dataset was curated using the NBA Stats API, an API that gives access to the data from NBA.com and provides a multitude of methods for obtaining Box Scores, Lineups, Play-by-Play, and many other data sets.

The scripts for my model are currently available on The Research Lab’s GitHub page. Feel free to explore the current repositories and if you feel the urge to do so feel free to follow our GitHub as I will continue to push my projects there.

The Model Intro

Okay, so now that the background is out of the way I’m sure you’re eager to understand how to replicate the results or even improve upon the results I obtained, so my goal is to enlighten you on the path I chose for modeling, some decisions I made along the way, and some ideas to further improve the model going into the future. With this being one of my first articles I am assuming that I have a broad audience so for the hardcore stats and data science people I leave you my Jupyter Notebook Scripts to do further digging.

The Model Algorithm

The algorithm I decided to go with for this binary classification model was logistic regression. Now right off the bat, I am aware that this is not the most complex model that I could’ve used however, in my experimentation phase I will say that I did also try this win-loss modeling with an XGBoost model and an Artificial Neural Network w/ 3 hidden layers using a total of 256 neurons and 100 epochs. Admittingly so, I did execute a rapid experimentation phase, but nonetheless, all the models generally resulted in similar performance. As a side note, I am not really one to have a favorite model for any other reason than it is the model that is best for my use case. So do with that information what you will.

The Model Predictor Variables & Target Variable

The logistic regression model used a set of intuitive features with the goal of predicting home team wins. The following predictor variables (ie features) are as follows

  • Days Rest — number of days between games
  • Total Win Pctg — a team’s win percentage based on home and away games
  • Home Win Pctg — a team’s win percentage based on home games alone
  • Away Win Pctg — a team’s win percentage based on away games alone
  • Offensive Efficiency — a measure to understand how well a team gets that ball in the hole 🙂 (check out this book Basketball Analytics)
  • Scoring Margin — How many points does a team lose or win by
  • Rolling Measures — converted some of the above metrics using rolling averages

Each row in the feature dataset represents these predictor variables for both home and away teams. Besides converting some of the measures to rolling measures the only other transformation I performed on this dataset was “Standard Scaling” and shifting the rows to prevent data leakage ie. “Given the current game I’m trying to predict, I can only have the information from previous games, no current or future games”. Now I’m sure the data scientist or statisticians would like me to expand upon things like feature importance or multicollinearity amongst features and this is where I will just smile and refer you to the scripts in the basketball analytics repo. With that, you are free to use methods such as Shapley additive explanations, Gini importance, Variance inflation factors, transforming features with PCA, or whatever else you like!

The Model Training & Performance Evaluation

Okay, we have finally come to the fun part where I can explain the somewhat clickbaity article title. This model was trained using data from the 2020–2021 and 2021–2022 NBA Seasons and validated with the game data from this current season so far (as of Dec 31, 2022). If you review the sample script I provided you will notice accuracy and f1 scores around 61%, which actually falls in line with the baseline prediction. This simply means if you were to predict the home team winning every game then you would be correct around 61% of the time. This is an analytic representation of the “home team advantage” in terms of wins, so before running to Vegas you must understand that this information is already baked into the odds. For the current 2022–2023 season home teams have won 338/548 games, so this checks out with the baseline win percentage.

Now when understanding the model’s performance of 68% precision and understanding why this may be valuable you might want to know exactly what this means. A precision of 68% when predicting the home team wins translates to 68% of the time the model predicts the home team wins the home team actually wins. Compare this to the baseline and now we have created a slight 7% edge. I might like to add that I am not condoning using this model to gamble with, this article is strictly for educational purposes 😁.

The Model’s Next Steps

If this light overview has piqued your interest and you would like to know where this development could go in the future you might be interested in this last section. In reality, the model described above is somewhat naive as there are more factors that theoretically drive home team wins. A couple of factors that could benefit the model’s performance are “Strength of Lineup” and “Clutchness”. Strength of Lineup refers to occasions such as if a team’s star player just tore their ACL. During this event, previous offensive efficiency metrics are assumed to no longer be valid because the main factor contributing to team-level success is no longer able to influence game outcomes. Another example of “Strength of Lineup” may refer to team cohesion/morale; this refers to the assumption that unhappy teams/players don’t play as well as happy ones. Clutchness is a bit more intuitive by its name. The NBA’s definition of this is “any game time when there are five or fewer minutes remaining in the game and the scoring margin is within 5 points”. If the team’s scoring margin is within -5 to +5 this may mean they get into a lot of high-pressure 4th quarter situations and the question is are they able to maintain efficiency during these moments or does their performance decline?

Outside of additional features, some steps that can also be taken are backtesting, error analysis, and model/data drift monitoring. Looking at historical odds data we can run our what-if analysis to understand the Vegas betting perspective. Ie if I followed the model’s predictions and bet on the game what would have the results been (profit/loss)? Again, 100% for educational purposes as even having the historical odds does not guarantee that is what the future odds will look like. Execution also plays a big factor in this as odds are continuously shifting until a game starts. Error analysis would involve looking at games where the model’s prediction failed and trying to understand patterns that may exist within the errors such as if the model doesn’t account for injuries maybe the model’s mispredictions all included lineups that deviate from the normal starting 5. I find this is a good opportunity to leverage clustering analysis using additional columns to somewhat streamline your error analysis process. Lastly, monitoring model/data drift refers to understanding the assumptions baked into the modeling process and checking to see if those assumptions may change over time. For example, if this model was trained using data from the 90s, changes in rules, player skill level, or other factors would likely reduce the predictive power of the model on NBA games in 2022.

Conclusion

Many veteran sports betting types will often tell you that predicting outcomes and developing a betting system is not an easy thing to do and I would like to reemphasize their sentiment as this is not for the faint of heart. I hope this article was able to pique your interest and potentially serve as a nice introduction for those like me that desire to dig deeper. I may determine that follow-up articles are necessary so I encourage those that are interested to stay tuned and remember kids, I am not responsible if you lose your Christmas money!

--

--