Machine Learning Sports Betting on the NBA Season (Before the Bubble)
This is a supervised learning model that utilized a large statistical dataset to predict NBA games and placed hypothetical bets on them. The stats have been gathered from basketball-reference using the sportsreference API. The full notebook can be found here.
The NBA is the most rapidly changing American sport with the transition to prioritizing 3-point shooting and playing at a much faster pace than before. For this reason, the data from seasons further past start to hurt the model’s performance. The key insight I discovered in order to increase its accuracy was to use a rolling average of the previous 3 games for each statistical category and only train the model on the previous 2 seasons.
Lets get right to the good stuff. This is how the model performed:
After losing $1400 in the first few weeks, the model rebounded immediately to its highest point of $1776 only a few weeks later, and then starts to fluctuate at a more reasonable pace. I’m happy with the results, having profited $1400, but the model is more volatile than I’d prefer. There are a few tweaks we can make in the future that might help minimize this risk.
Firstly, I found my odds data from this site. While most matchups pass the eye test for the better team being favored, I haven’t vetted the odds against other Vegas data.
Secondly, the model bets on every single game instead of identifying potentially favorable bets. I foresee this as a change that could increase its profitability the most because we’re sometimes risking $100 to make ~$2 for heavy favorites. For example:
And these are some of the dumb bets that we lost. For reference, the first row would have been a $3 profit, but we lost $100.
But we did recoup those losses with upsets. Here are the model’s biggest upset picks! I’d like to thank the Memphis Grizzlies for the $1100.
Here’s how each team performed for us:
Let’s start from the beginning. Our machine learning model can only be as good as the dataset available to us, one of the reasons why I love to use sports data so much. This domain has some of the best data keeping, dating back further than most industries.
Approach to the Problem
Our goal is to predict NBA games and place informed bets on them. There’s 2 facets to the data that we need: (1) historical statistical data for the games and (2) Vegas odds data.
For the first component, basketball-reference is the authoritative resource, alongside the official NBA stats page. I used the sportsreference API to take every stat from every box score for as many past seasons as I like and put that data into a pandas dataframe. There are over 70 potential statistical features to utilize for prediction.
I found the odds data at sportsbookreviewsonline. I’m not certain exactly how accurate these odds data are, but they pass every sanity check I’ve made where the clearly better team is always favored. Here’s another simple check — the below table shows the top 10 positive correlations to a home win:
The home and away odds are on opposite sides of the correlation spectrum and line up with the proper team winning so our data pass another eye test.
The next step is to decide which statistics — aka features — we’ll include in our prediction model. One point of concern is that there is a high level of multicollinearity among the features, meaning they are very interconnected and are not stand alone independent variables. This means that we won’t have much explanatory power in our model to inform us of which feature is the most important for winning, but we’ll still be able to make powerful predictions. Below is the heatmap for the correlations among each feature.
All the darker blue and white spots represent highly correlated features. The way we’re going to mitigate the multicollinearity is twofold: (1) Use Principal Component Analysis, an unsupervised learning technique, to select our features for us, and (2) create a pipeline of several types of models for which we can compare and contrast performance based on changes made upstream. These two techniques will be explained explicitly later in the project.
We have a very important step to take before finalizing our features. Since our model will be making predictions, we need to transform the data to match what we would have access to at the time of prediction. Since we are working with historical data, we could wrongly use data from Jan 2 2018 to help inform our model on a prediction about Jan 1 2018. This would lead to more accuracy. However, when we implement our model into the real world it can only use data available at the time to make a bet on a game.
Match Features vs External Features
This paper by Rory P. Bunker and Fadi Thabtah in 2019 was my biggest help in framing the solution. They’re focusing on soccer, but this same concept applies:
External features are known prior to the upcoming match to be played. For example, we know the distance that both teams have traveled and we know both teams’ recent form leading into the upcoming match. Match-related features however, are not known until the match has been played. Thus, we only know an average of these features for a certain number of past matches for these teams. For example, we would know the average passes made per match by both teams prior to the match, but do not know the actual passes made in the upcoming match until after it has been played. This means that only past average statistics for these features can be used to predict an upcoming match. Therefore, match-related features should undergo a separate averaging process before being re-merged with the external features.
So our match-related features are the box score statistics, but I also included a couple external features that were exempted from the transformation: the total win-loss record for each team at that time, and the Vegas odds. I considered using location, but I used the stats as home/away splits so I think home court advantage is well represented already.
Applying the Transformation
Now let’s put this idea into practice. Currently, each row of data represents a box score of a completed game with all the home and away features, as well as the result.
As discussed, we don’t have access to that information when predicting, so we’ll transform each statistical feature to become the rolling average of the previous X games for each team. The first X games of each season are dropped since there isn’t enough information. Another note: the rolling average is from the previous X home or away games for that team. This accounts for teams’ road vs away performance but could lose some performance where a team is playing their first home game after many road games and vice versa.
Since we need to decide on what number to use for the previous X matches, we’ll need to write a function that transforms our data so that we can change the value of X and see how it performs downstream when we test the model’s performance. This concept is crucial to every aspect of a machine learning project and is referred to as a pipeline. When all decisions and assumptions that are made are not hard-coded, but are more like knobs that can be turned, we can twist several of them until our peak model performance is reached.
This particular function to transform this dataset was perhaps the most technically challenging task up to that date for me. But now, when I revisit the code, it seems so obvious and simple. That about sums up the entire experience of learning to code. Here it is:
This function takes 2 arguments: (1) a pandas dataframe that in this case represents all the box scores from one NBA season and (2) the num_games to use for the rolling average. It iterates through home and away columns separately, then for each stat finds the corresponding team’s last num_games performance, averages them, and replaces the current value with that average. It also drops any games from the beginnings of seasons that don’t have enough data. Therefore, when game day arrives in real life, the data the model uses to make our prediction on that game is identical to that which it was trained on. However, because we’re dropping the first few games for each team when we don’t yet have enough information, we currently have to wait until a few weeks into a season to start making picks, another potential weakness of the model. Here’s a snippet of what a row looks like afterward:
Scaling the Features
Now that our features are transformed, I’m going to use a scalar from scikit-learn, one of the most versatile and popular data science python libraries. Scaling is a statistics method that transforms all values in a column to be relative to one another, reducing the effect of outliers. I think our odds data is the only feature here that might be prone to having outliers, and I’ve identified a few, but we’ll scale them all anyway, because there’s not much downside.
Now it’s time to address the multicollinearity of our features, the fact that each stat is very interdependent on one another. For example, points scored, 3 pointers made, and 3 point accuracy are far from independent of each other. A team with a high point total will probably have more 3 pointers made and higher accuracy. Therefore, we have a lot of redundant information. It probably doesn’t matter much in this case since we have plenty of memory and computing power to handle this dataset, but imagine if for example we were working with a billion rows of Twitter user data — we’d need to be as practical as possible with our memory.
Thankfully, there’s an unsupervised learning algorithm called Principal Component Analysis (PCA) that will easily reduce dimensionality for us. We can run the algorithm and plot the explained variance of our features versus the amount of features used. The explained variance is essentially the amount of potential prediction power gained from the data.
As seen above, at around 30 features we start to have some redundancy in our features, but I kept 40 just to be safe. In a simple line of code we can apply this transformation:
Now that our data is in the exact shape and manner we need for predicting games, let’s get to it! We’ve created a pipeline so that we can easily create a variety of models with a couple lines of code and compare their predictive accuracy.
Now we have to think about what it means for our model to perform well. If the model bets on heavy favorites all the time because they are most likely to win, we are essentially competing 1v1 against Vegas oddsmakers, and that’s never worked for anyone. However, we do have access to the odds before tip-off, so they are included as one of the features to help mitigate this. It’s not a perfect solution though, because odds aren’t the objective reflection of probability to win. Vegas uses odds to try to split the betting population in half, and make money off the rake. In other words, they’re like a poker table at a casino — they provide the environment and financial backing to play, but make their money off the service fee for winning a bet. If our model favors only picking games correctly, it could fall victim to losing big on upsets and never recouping those losses with our own upset picks.
Anyways, I could run circles in my head all day (and have) about the best way to define what an accurate model would be, but I decided to create a baseline first and iterate from there. We still will focus on true/false for a correct prediction. With the inclusion of odds and our thoughtful transformation of the statistical data we should at least have a good place to start making picks.
Creating and Comparing Models
When I got this far into a machine learning project for the first time, I didn’t anticipate that all the hard work had already been done. These models have already been coded out by teams of geniuses, and all we have to do is feed it the proper data. Obviously that is an oversimplification, but the sentiment remains true. As this project becomes more nuanced, I’ll start to use more advanced ways of tweaking these models’ settings, called hyperparameters, but again these are already perfectly functional for our baseline.
Here are the results of the accuracy of each model:
We had decent performance! 66% accuracy in picking games seems like something we could use profitably. Naive Bayes is definitely the leanest, most simple model, so it’s encouraging how well it performed, and makes it easy to pick it as the baseline to iterate from. There are millions of words printed about the pros and cons of each model, their best use cases, and so on, so I’ll leave that for more technical publications.
Let’s calculate the money we (hopefully) won! Remember, the model was only trained on previous seasons, and only used data it would have had access to if I had really implemented this model at the beginning of the season.
After losing $1400 in the first few weeks, the model rebounded immediately to its highest point of $1776 only a few weeks later, and then starts to fluctuate at a more reasonable pace. I’m happy with the results, having profited $1400, but the model is more volatile than I’d prefer.
I think this is a great starting point to iterate from, as we have several areas we already know where to improve, but the model’s performance is definitely acceptable. My goal is to implement these improvements, test many different variations, and have a betting model ready for tip-off of the 2020–21 season!
These are the top priorities:
- This model has yet to include any player data. A category that includes a rolling roster of top performers and whether they are out with injury or not could prove useful.
- More explanatory power in order to give a confidence rating on single game predictions. Currently the model relies on long term use with a bet on every game.
- Opting out of dumb bets. I could train the model to decide if a bet was worth the risk or not, instead of spreading that risk wide. The confidence interval could be used to compare against the odds.
Rory P. Bunker, Fadi Thabtah, A machine learning framework for sport result prediction, Applied Computing and Informatics, Volume 15, Issue 1, 2019, Pages 27–33, ISSN 2210–8327, https://doi.org/10.1016/j.aci.2017.09.005. (http://www.sciencedirect.com/science/article/pii/S2210832717301485)