Using Machine Learning to Predict NBA Winners Against the Spread

Jordan
11 min readAug 3, 2021

--

Being an avid NBA fan and a casual sports bettor, I wanted to see if I could use my data science skills to create a model that could be profitable betting NBA games against the spread.

The sports betting industry has been surging in the US the past few years. As of writing this, it is legal in 26 states, with this number likely to increase in the next few years. This is hardly surprising, as sports betting has many appeals to its massive fan base. The allure to many is being able to show off your superior knowledge of the game with the potential to make money while doing it definitely adds excitement and entertainment.

So how does sports betting work? There are many ways to bet on a game. You can bet the money line, point totals or spread as well as a variety of other prop bets. For this project, I decided to focus on the spread. The spread is the number of points that the bookie gives or takes a away from a team’s score to equalize the playing field and thus incentivize betting on both sides.

To understand how the spread works, it is easiest to look at an example. Let’s suppose Lakers are playing the Timberwolves. Most people would probably pick the Lakers (in 2020/21) to win, so the bookie might set a spread of Lakers -7 or Timberwolves +7. Now, you if you pick Lakers -7, it’s not good enough for them to just win— they need to win by more than 7 points. Or if you pick the Timberwolves +7 they need to lose by less than 7 (or win). If the line is properly set, the bookie will receive roughly the same amount of money on both sides so that the losers pay the winners, minimizing the bookie’s risk.

So then how does the bookie make money? The answer is juice — the small fee (in the form of reduced odds) when you place the bet. When you make a bet against the spread, often times the odds are -110. This means you risk $110 to win $100. You might think that if you place two bets, win one and lose one, then you should be back to even. However at these odds, you can see that you will be down $10. 100+(-110)=-10. Thus if we have a 50% success rate for each bet ATS (against the spread), our expected value is -$5 per bet. Not good. So, the natural question is what percentage of bets do we need to win in order to break even?

If w is the percentage of bets you win, then your expected value per bet is given by this equation:

To see what percentage we need in order to break even, we can set EV equal to 0 and solve the equation:

Thus if we can win more than 52.38% of all bets made, we should be able to make a profit.

So, just how good are bookies at setting their lines? Is it possible build a model that allows me to make bets against the spread and be right more than 52.38% of the time? This is the question I sought to answer. To accomplish this task I will follow the following steps.

  1. Gather Data
  2. Clean Data
  3. Feature Engineering
  4. Create Models
  5. Evaluate Results
  6. Implement a Betting Strategy

All of the code is available on my github if you are interested.

I. Gathering Data

As with many data science projects, often times the hardest and most time consuming part is the data acquisition.

For this, I needed two types of data:

  1. Historical Boxscore Data
  2. Historical Betting Lines

I used NBA_API — a python wrapper for the API on NBA.com — to gather box scores from 2000–2020. This took several days as I had to insert frequent sleep statements to avoid timeout errors.

I gathered three types of box scores: basic, advanced and scoring box scores for 1st half.

  • Basic box scores contained stats like points, rebounds, steals, assists.
  • Advanced box scores contained more advanced metrics like off/def rating, pace, eFG% and TS%.
  • Scoring box scores contained more nuanced stats about how the teams scored, like fast break points, points off turnovers, and points from mid range shots.

Historic betting lines were a little harder to find, however I was able to use Selenium to scrape data from www.sportsbookreview.com going back to the 2006–07 season.

Boxscore Data:

Spread Data:

II. Cleaning Data

For the box score data, I wrote a function that:

  1. Changes W/L to 1/0
  2. Renames franchise to their most recent abbreviation
  3. Converts GAME_DATE to datetime object
  4. Creates a binary HOME_GAME feature.
  5. Removes rows where advanced stats were not collected

For the betting spread data I

  1. Renamed the teams to the abbreviations the boxscore data used
  2. Expanded spreads into multiple columns and created a column that took the mode of the four betting lines (as they were from four different books).

You can view the full functions on my GitHub.

Finally I merged these two datasets on GAME_DATE and team (as no team will play twice on the same .

Here are the results of my merged data:

This cleaning and preprocessing step took a lot longer than described above, but I summarized and left out some steps for brevity, as data wrangling is the not the most interesting (albeit necessary) step.

Feature Engineering

We of course cannot simply train a model on the dataset above. Each row contains information on the event we are trying to predict. For this to be useful, I can only use information known prior to the game. To do this, I decided to use an exponentially weighted moving average (EWMA). Exponentially weighted moving averages give more bias to recent games which I thought might better capture a teams’ current performance.

EWMA can be calculate recursively with this formula,

The most important term here is alpha. We can see an alpha of 0.1 means that next EWA will take 10% of the current observation and 90% of the previous EWA.

I played around with different alphas and ended on 0.1, meaning I gave the most recent game a weight of 10%.

Lastly, to reduce the number of features I had I computed the differences between the teams’ respective stats. For example, I subtracted Team A’s FG2M — Team B’s FG2M.

In addition to the box score stats, I also added replicated ELO ratings for each team prior to the game they played, number of rest days, the spread and money line and the teams’ win and cover percentage.

Here is a snippet of what my data looked like at this point:

Metric

I originally intended on using accuracy, (TP + TN) / (TP + FP + TN + FN) because the classes were balanced, and accuracy is an easy to understand metric that would allow me to see if my model can exceed an accuracy of 52.38%. However, shortly after creating, tuning, and testing my models I realized that no model could consistently beat 52.38%.

Thus, I shifted to F1 Score. F1 Score is the harmonic mean of precision and recall.

Precision = True Positives / (True Positives + False Positives), or “of all the predicted positives (covers), what percentage were correctly identified?”

Recall = True Positives / (True Positives + False Negatives), or “of all the actual positives, what percentage were correctly identified?”

This is because I knew that it would be extremely difficult to bet every game, so I would need to assign a confidence threshold at which my model would make bets. This threshold would try to find a balance between maximizing precision and recall, and thus reducing false positives and false negatives — instances where we lose money.

Creating Models

I will train the following models:

  1. Logistic Regression
  2. SGD Classifier (modified huber loss)
  3. Linear SVC
  4. Random Forest Classifier
  5. LGBM Classifier
  6. K-Nearest Neighbors
  7. Stacking Classifier using models 1–6

I will train my model on years 2006–2015, and use seasons 2016–2020 to test my model. This is roughly a 66/34 split.

Evaluate Results

After optimizing the hyperparameters for each model, I was able to achieve the following results:

While my LGBC model performed best on my test set, no model was able to exceed the 52.38% threshold required to be profitable. Because of this, I decided to try one last thing. Perhaps there is just too much unpredictability in sports to be betting every game, and I need to be more selective with which games to bets.

I decided to test this out by using the scikit-learn’s predict_proba method, and seeing if only making bets at a given probability threshold could be profitable. I wrote a function find_optimal_threshold to test all the probability thresholds and 0.01% intervals and see which one gave the highest profit in my test set.

For my LGB classifier, this threshold was 52%. If my model only bet on games that it was at least 52% confident, it would have bet 2250 games (about 20%) over 5 seasons with a win percentage of 55.42%. This proved to me that there was some threshold at which I my model could be profitable. The problem was finding this threshold. I knew I couldn’t just use 52%, since I used the results of my model’s performance over the test seasons to get that percentage. In a real scenario I wouldn’t know the optimal threshold before the season began.

I decided to use a threshold of 52.38% because that was the winning percentage I knew I needed to be profitable. Intuitively this makes sense, because if my model is less than 52.38% confident of a decision, the expected value of that bet would be negative.

When using this threshold, I found that the Stacked Classifier was able to generate the most profit in my test set.

Finally, to make testing my model more realistic I tested each season individually, and retrained the model on all the previous seasos before it.

Betting Strategies

What makes a good betting strategy? Of course we want to maximize our winnings, however most bettors may also want to minimize the variance so that they don’t have huge swings that potentially cause them to go bankrupt.

Two common measures of betting strategies is ROI (return on investment) and yield.

ROI is the the Profit / Initial Investment (Starting Bankroll)
YIELD is the Profit / Total Amount Risked

Using the stacked model I will test out 3 different betting strategies:

  1. Bet a constant amount on each bet (constant_strategy)
  2. Bet a constant percentage of your bankroll (percentage_strategy)
  3. Vary bet size based on model confidence (threshold_strategy)

For each of these strategies I will start with a bankroll of $100 and simulate betting the games in a season using the given strategy.

Constant Bet Size ($3)

In this method we are overall profitable, however we take losses in the 2017 season. Overall however we get a average an impressive 28.04% ROI per season.

Percentage of Bankroll (3% of bankroll)

This is a common system where our bet size is proportional to our bankroll. On the plus side, this system mitigates our risk because our bet size decreases proportionally to our bankroll and thus we will never go bankrupt. The downside is that it becomes harder to make significant gains if we get into a hole, as our bet size is smaller and we have to win more times to make the money back.

Varying Bet Size by Model Confidence

In this final strategy, I vary the bet amounts based on the model’s confidence (predicted probability), shown in the table below.

Bet Size Thresholds

This was by far the most profitable betting strategy, with an average return of 40.8% per season. Here are graphs tracking the bankroll over each of the seasons.

It’s interesting to look at the flow of wins and losses over a whole season. We see that every season at some point drops below our starting bankroll, sometimes nearly going bankrupt. If we were actually placing these bets it might be psychologically tough to convince ourselves that the model works.

For example, in the 2016–17 season our bankroll falls as low as $10.09, from which we go on a run up to $229 and then end at $198. Knowing when to keep betting and when to stop is one of the hardest things about gambling and continues even if we have a model to rely on.

Results

In the end, my model was able to make a pretty significant profit, averaging about 40.8% per year for the past 5 seasons. While this is better than the S&P 500 (which averages about 10% per year), it also comes with a much higher degree of risk. For example, in one of the five seasons, we lost 30% of our initial investment. These types of massive losses are definitely more rare investing in a total market index fund. Furthermore, anyone who invests knows the saying “past performance is no guarantee of future results.” While my model and betting strategy performed well over these 5 years, who knows what next year could look like.

That said, as someone who does enjoy sports betting (as a hobby, not profession) I am looking forward to testing it out and tracking my results for the 2021–22 season.

Next Steps

There are many ways I can build upon the model I have created.

  1. Handling injuries, trades, and other real time news
    Currently my model is blind to any real time news (injuries, DNPs, trades) that can affect the betting lines and a team’s strength. When I actually use my model to place bets next season, I will either need to find a way to incorporate this information into my model or manually decide some news makes it a no bet situation.
  2. Using tracking data
    The NBA collects incredibly detailed data on players/teams such using computer vision such as the type of shot, number of drives, passes, touches, distance covered. These could be valuable new features that could improve the model. The problem is that this started in 2017–18 so there is only 4 seasons of data available. I may try to play around with the little data we have to see what results I can get, but more likely I will need to wait for more data to produce something more robust.
  3. Betting Trends and Public Opinion
    Lastly, there is information bettors use that aren’t directly related to the teams’ stats/strengths. Betting trends such as the home team winning ATS in Game 1 of the Finals, 16 of the last 19 seasons would be incredibly useful information if we could figure out how to include it as a feature in our model. Incorporating public opinion and money on each side could also valuable. For example, if 80% of the money is coming in on one side but only 30% of the bettors are on that side, it could indicate that the sharps are on that side.

I look forward to continuing to work on these ideas in preparation for the 2021–22 season, and will continue to document my process. Thanks for reading.

--

--

Responses (1)