How I Achieved a 56.4% Accuracy and 7.6% ROI Using Machine Learning to Bet on NFL Totals (Over/Under)
How I beat the bookmakers using Machine Learning.
Background
I found success previously using weather data and machine learning (ML) to accurately predict the 2023 Kentucky Derby’s winning time and MLB’s 2023 Totals (Over/Under). After my fantasy football teams were shown to be pathetic (week 3) I decided to see if I can use weather data to accurately predict the Over/Under of NFL Games.
Data
Historical data from 1966 all the way up to the end of the 2022 season (2/12/2023) was pulled from here: https://www.kaggle.com/datasets/tobycrabtree/nfl-scores-and-betting-data. I filtered data from 2000 onwards and used this data to try and predict over (1) or under (0) based on the following predictors:
- Favorite team’s spread
- Over/under line
- Temperature (F), if in a stadium I defaulted to 70
- Wind speed (mph), 0 if in a stadium
- Humidity, defaulted to 40 if in a stadium.
Weeks 3 to 14
Training and Testing
Data was split into 67% training and 33% testing. A gradient boosting machine (GBM) was used based on the following:
# Create gbm model #
gbm_mod <- gbm(
formula = ou ~ .,
distribution = "bernoulli",
data = train_data,
shrinkage = 0.001,
interaction.depth = 2,
n.minobsinnode = 10, #Default
bag.fraction = 0.5, #Default
n.trees = 500, #Determined using 5 fold cross validation
n.cores = NULL, # will use all cores by default
verbose = FALSE
)
We will use the prediction’s median probability to differentiate between an Over and Under prediction. This is because historically the Over/Under results in 50% respectively which is expected as this guarantees a profit for the bookmakers via the vig. The median prediction probability on the test set was 0.510 which means any value below 0.510 will be classified as an Under prediction and any prediction over 0.510 will be classified as an Over.
We’ll break the accuracy of the predictions on the test set down by the standard deviation of the prediction’s value:
Our test data indicates that if we follow our model’s predictions with probabilities of 0.482 to 0.495 and 0.50 to 0.51 it should lead to an accuracy of 391/721 = 54.2%. Interesting to note that the model appears to have found an advantage in betting under based on specific weather conditions that maybe the bookmakers didn’t pick up on. To be profitable sports betting you have to have a win percentage (accuracy) of 52.4%, which means utilizing the model above would make us profitable. This is impressive because only ~3% of sports bettors are actually profitable.
So how did we do?
Weeks 3 to 14 went great and showed that we could use weather data and machine learning (ML) to accurately predict NFL Totals. However, one thing that really bothered me was the three losing weeks in a row starting week 12. You’ll notice the last column called “Method” shows “No Seasonality”. I found it interesting that starting week 12, which is the Thanksgiving weekend, my model started to perform poorly. My theory is that based on time of the year (month) the Over/Under prediction would change. From Week 15 to the end of the season I included the month in my prediction.
Week 15 Onward
So how did my model change when I included the month:
By including the month our accuracy increases on the Test data from 54.2% without the month to 59.8%. So how did we do in the last few weeks of the 2023 NFL season by including the month?:
Including the month in our predictions significantly improved our model’s accuracy from 54.6% and 4.2% ROI to 62.2% and 18.5% ROI.
Overall Results
Going forward I will use the model that includes the month to make my predictions and hopefully will achieve a long term accuracy of close to 60%.