Machine Learning Model for NFL Betting (Model 5.0)

16 min readMay 12, 2023

Sports betting has become a popular and lucrative industry in recent years, with many people looking to use data and technology to gain an edge in their betting strategies. Machine learning, a powerful subset of artificial intelligence, has emerged as a popular tool for building predictive models that can help sports bettors make more informed decisions.

In this article, we will explore the basics of creating a sports betting model using machine learning, focusing specifically on the National Football League (NFL). We will discuss the data sources and preprocessing steps needed to build a robust model, as well as feature engineering and model selection techniques. By the end of this article, you will have a foundational understanding of how machine learning can be used to build a successful sports betting model, and you will be equipped with the tools and knowledge needed to get started on your own model-building journey.

In order to conduct a thorough analysis of NFL betting data, we developed four primary models.

Simple Classification Model: A logistic regression model was used where predicting variables were limited.
Simple Regression Model: A XGBoost regression model was used where predicting variables were limited.
Complex Classification Model: A logistic regression model was used with numerous predicting variables.
Complex Regression Model: A XGBoost regression model was used with numerous predicting variables.

The code for the models and supporting details can be found below links. The medium article hits on the highlights of the analysis, where the kaggle site has the results from each block of code.

#1 and #2 — Kaggle Simple Model v5

#3 and #4 — Kaggle Complex Model v5

The two main techniques used for the models were logistic regression for classification models and XGBoost for regression models.

See below for specifics of what was predicted within each quadrant. Classification models are used to answer binary questions such as whether a team will cover the spread. Logic was incorporated into the models that a push and a loss on the spread were considered a loss. Regression models, on the other hand, are built to determine a specific amount. A regression model will predict how much the home team and away team will score, where you can use the predictions to deduct who will cover.

Results on Analysis

Accuracy is the proportion of correctly classified instances (true positives and true negatives) out of the total number of instances. In simple terms, it is the percentage of correct predictions made by the model.

Mean absolute error (MAE) measures the average absolute difference between the predicted values and the true values. A lower MAE indicates that the model is making more accurate predictions.

Below are the simplified results of the analysis on the training set of data.

The complex model yielded better results; however, the results are predicated on the accuracy of some of the predicting variables. The team stat details is the data set that requires the most judgment. For example, one of the stats in the initial data pull for team stats is rushing touchdowns. Rushing touchdowns was stripped out in the final data set used in the model as even though we aren’t forecasting rushing touchdowns, rushing touchdowns is highly correlated (depending on team) to total score that the accuracy of the predicting variable probably is pretty close to the total score.

A case could be made to keep rushing touchdowns in the final data set as maybe a team is run heavy and an opposing team poor run defense, so even though the prediction may not be perfect it will help guide the overall score prediction. Rushing yards, rushing yards per attempt, opponentrushing yards and opponentrushingyardsperattempt were left in based on judgment to capture mismatches in rushing offense vs defense.

To build further off of the challenges with using team stats, a question becomes how to you forecast team stats outside of the training yet? Do you use a three-year average, a five-month average or a forecasting model within the overall forecasting model driven by certain predicting variables to predict outside of the test period? How often do you update? How this is completed drives the maintenance needed in order to create the most accurate forecast by week.

The results of the simple model for classification were slightly better than 50/50. That may be all that is needed for edge and the model requires little upkeep. If some of the predicting variables where little judgment is needed in forecast (i.e., referee name, stadium details) were added to the simple model, it may increase the edge even further. Another technique to squeeze out a bit more accuracy would be to use different classification model than logistic regression (see Lazy Predict section at the end for more details). Somewhere in between the simple and complex models may lie a better blend of accuracy and upkeep. I would hesitate to use the simple model for regression based on accuracy of results in the train test set.

The below showed the most important predicting variables on the respective models. See further herein for more details on impact of predicting variables to the respective model.

Predicting Variables

The initial step in constructing a machine learning model for sports betting is acquiring the requisite data sources. However, this can be a complicated process, as the data is often not prepared in a manner that facilitates effective integration into the model. Three years of historical data was used to create the models.

Sources of data for the Complex Model were as follows:

Fantasy data — data on stadium details, team stats and betting data. (https://fantasydata.com/api/api-documentation/nfl#/odds)
NFL Penalties — who ref’d the game (https://www.nflpenalties.com/)
FiveThirtyEight — details on team rankings (https://github.com/fivethirtyeight/data/tree/master/nfl-elo)
Google Trend data — shows search stats by week (https://trends.google.com/home)

Betting data and with google trend data were used in the Simple Model from the sources listed above.

Fantasy data

The API into the fantasy data set provided a plethora of great data. See below for samples of the data used (see kaggle links for full details).

Stadium Details

The below provides a snapshot of the final product after transformation. PlayingSurface was classified as 1 for artificial, 2 for dome and 3 for grass. Type was 1 for dome, 2 for outdoor or 3 for retractable dome.

Team stats

As mentioned, this is the data set that in my opinion is the hardest data set to deal with. There was so much data in the initial pull. See below for sample. The site has a great dictionary to help understand the stats better (https://fantasydata.com/api/data-dictionary/nfl). Judgment was used in what stats to keep in for final model.

Betting details

This was my favorite data set, it showed consensus odds for games and results. Code was layered on top of the data set to determine which bets won and which bets lost. OverUnder is the odds on the total score to the game, where TotalScore is the actual total score to the game.

NFL Penalties

Referee Details

I couldn’t find a data set to pull this into an API, so I created this one myself by scraping NFL Penalties website. In order to use this variable, the code written requires staying up to date on the list of head referees in the league. I don’t think there is a high attrition rate for NFL referees, however I did notice a few dropping off. The below was used to assign a variable to name of referee, where the referees were than assigned to each game.

{'Adrian-Hill': 1, 'Alex-Kemp': 2, 'Bill-Vinovich': 3, 'Brad-Allen': 4, 'Brad-Rogers': 5, 'Carl-Cheffers': 6, 'Clay-Martin': 7, 'Clete-Blakeman': 8, 'Craig-Wrolstad': 9, 'Jerome-Boger': 10, 'John-Hussey': 11, 'Ron-Torbert': 12, 'Scott-Novak': 13,  'Bill-Vinovich': 14, 'Shawn-Hochuli': 15, 'Shawn-Smith': 16, 'Tra-Blake': 17, 'Land-Clark':18, 'Tony-Corrente': 19})

FiveThirtyEight

Team Details

FiveThirtyEight provides various data and analytics related to NFL teams, games, and betting odds. This includes data on team rankings, win probabilities, and point spreads. The data went back to 1920, which I only needed data back three years so I stripped a lot of it out. There was also a variable for whether a game was played on a neutral site.

elo1_pre and elo2_pre represent the pre-game Elo rating of the two teams playing the game.
qbelo1_pre and qbelo2_pre are an extension to the Elo system that incorporates quarterback adjustments, with qbelo1_pre and qbelo2_pre representing the pre-game QB-adjusted Elo ratings of the two teams.

In the Elo system, each player or team starts with a rating, which is a numerical value that represents their skill level. When two players or teams compete against each other, the outcome of the match is used to adjust their ratings. If a player or team with a lower rating defeats a player or team with a higher rating, their rating will increase more than if they had defeated a player or team with a similar rating. Conversely, if a player or team with a higher rating defeats a player or team with a lower rating, their rating will only increase slightly, if at all.

Google Trend Data

Trend data

Google trend data is a tool that provides insights into the popularity of search terms on Google. It measures search interest relative to the highest point on the chart for the given region and time. The numbers on the chart represent the search interest relative to the highest point on the chart for the given region and time. A value of 100 is the peak popularity for the term, and a value of 0 means there was not enough data for this term.

The hope with this variable is it would capture popularity of a team, which may skew betting odds somehow.

Other Predicting Variables

Weather

Early models used weather from the National Oceanic and Atmospheric Administration (NOAA) site, however the complexity of making sure weather from the weather station used lined up to the geographical location of stadiums along with complexity with domes ultimately led to weather variables not being used.

Time Based Predicting Variables

Rather than using weather data, time based variables were added in hopes of capturing weather type factors. Time based factors were also used to capture quick turnaround times for Thursday games, late night games, etc. The time based features did show up as features used, so I think this is an important variable. Also, it is easy to factor this in as a predicting variable outside of the training data.

#adding time features
df['hour'] = df['DateTime'].dt.hour
df['dayofweek'] = df['DateTime'].dt.dayofweek
df['quarter'] = df['DateTime'].dt.quarter
df['month'] = df['DateTime'].dt.month
df['year'] = df['DateTime'].dt.year
df['dayofyear'] = df['DateTime'].dt.dayofyear
df['dayofmonth'] = df['DateTime'].dt.day
df['weekofyear'] = df['DateTime'].dt.weekofyear

Whether or not betting was legal in a state

I went through state by state and team by team to determine whether it was legal to bet in the state. Ultimately did not use this variable, I wasn’t comfortable I had the information right by year. Also, it made the simple model less accurate, so threw this out.

#https://www.actionnetwork.com/news/legal-sports-betting-united-states-projections
#0 yes, 1 no

# create the dataframe
legal = pd.DataFrame({'Team': ['1', '2', '3', '4','5','6','7','8','9','10','11','12','13','14','15','16','19','20','21','22','23','24','25','26','28','29','30','31','32','33','34','35'],
                   'HomeLegal': [0,1,0,0,0,0,0,0,1,0,0,0,1,0,1,0,1,1,0,0,0,0,0,0,0,1,0,1,1,1,0,0],
                   'AwayLegal': [0,1,0,0,0,0,0,0,1,0,0,0,1,0,1,0,1,1,0,0,0,0,0,0,0,1,0,1,1,1,0,0]})

legal['Team'] = legal['Team'].astype(int)
legal['HomeLegal'] = legal['HomeLegal'].astype(int)
legal['AwayLegal'] = legal['AwayLegal'].astype(int)

Final Data Used in Models

Below is the final data used in the complex model.

Only the google trend, time based variables and betting data was used in the simple model.

Visuals and Analysis on Data

So many interesting visuals on the data. I picked out my favorite below, but there are plenty more on the kaggle site.

OverUnder to TotalScore

The first is simply a comparison of OverUnder odds to actual TotalScore to the game for last three years. I think this visual shows the complexity in predicting total score to a game.

Average OverUnder to TotalScore by Week

The below takes average TotalScore by week and compares to average OverUnder by week. No real trend noted.

TotalScore less OverUnder

The below charts looks at the Browns to determine variability in TotalScore less OverUnder odds. A similar analysis was done for each team.

The below is just a data dump of TotalScore less OverUnder to look for trends. A positive result indicates TotalScore is greater than OverUnder.

See below visual for the makeup of a box and whiskers plot.

The below box and whiskers plot on total score would indicate scoring is trending down from 2020.

Whether a team was over on the OverUnder bet

A 0 represents the team went over. A 1 represents under or push. The percentage in the far right represent over divided by total.

The below is sorted by greatest to least on teams that were over the OverUnder.

Whether a team covered the spread

0 is the team covered. A 1 is a push or a loss. The percentage total is covered/total.

Sorted by teams that covered the most. Tampa Bay at 45% would indicate Tom Brady is not always easy $.

Google Trend Data

Below represents the average interest by week over the last 5 years. There are some flaws in the approach to this data pull, for example didn’t account for team names changing. Top 3 no surprises, extremely suspired on the bottom 3.

Deeper Dive on Model Accuracy

Accuracy is the number of true positives plus true negatives divided by the total number of observations in the dataset.

Precision is the ratio of true positives (TP) to the total number of positive predictions made by the model, which includes both true positives and false positives (FP). In other words, precision measures the accuracy of positive predictions made by the model.

precision = TP / (TP + FP)

Recall is the ratio of true positives (TP) to the total number of positive predictions made by the model, which includes both true positives and false positives (FP). In other words, precision measures the accuracy of positive predictions made by the model.

recall = TP / (TP + FN)

F1 Score is a harmonic mean of precision and recall, which provides a single score that balances both metrics.

F1 score = 2 * (precision * recall) / (precision + recall)

Simple Model Logistic Regression

Home Team Prediction Logistic

Metrics on model.

Comprehensive list of most important variables to the model.

Away Team Prediction Logistic

Metrics on model.

Comprehensive list of most important variables to the model.

OverUnder Prediction Logistic

Metrics on model.

Comprehensive list of most important variables to the model.

Complex Model Logistic Regression

Home Team Prediction Logistic

Metrics on model.

Comprehensive list of most important variables to the model (there are more on kaggle, list goes to 141).

Away Team Prediction Logistic

Metrics on model.

Comprehensive list of most important variables to the model (there are more on kaggle, list goes to 141).

OverUnder Prediction Logistic

Metrics on model.

Comprehensive list of most important variables to the model.

Simple Model XGboost Regression

Mean squared error (MSE) measures the average of the squared differences between the actual target values and the predicted values. The formula for MSE is:

MSE = (1/n) * Σ(yᵢ — ȳ)²

where yᵢ is the actual target value, ȳ is the mean target value, and n is the number of observations.

MSE assigns higher weights to larger errors, making it more sensitive to outliers. A lower MSE value indicates a better fit of the model to the data.

Mean absolute error (MAE) measures the average of the absolute differences between the actual target values and the predicted values. The formula for MAE is:

MAE = (1/n) * Σ|yᵢ — ȳ|

where yᵢ is the actual target value, ȳ is the mean target value, and n is the number of observations.

MAE assigns equal weights to all errors, making it less sensitive to outliers. A lower MAE value indicates a better fit of the model to the data.

R-squared (R²) measures the proportion of the variance in the target variable that is explained by the regression model. R² ranges from 0 to 1, where 0 indicates that the model explains none of the variance in the target variable and 1 indicates that the model explains all of the variance. The formula for R² is:

R² = 1 — (SSᵣₑₛ / SSₜₒₜ)

where SSᵣₑₛ is the sum of the squared residuals (i.e., the differences between the actual target values and the predicted values) and SSₜₒₜ is the total sum of squares (i.e., the differences between the actual target values and the mean target value).

A higher R² value indicates a better fit of the model to the data. However, R² should not be used as the sole metric for evaluating a model’s performance, as it does not take into account the complexity of the model or the potential for overfitting. Other metrics, such as MSE and MAE, should be used in conjunction with R² to provide a more comprehensive evaluation of a regression model’s performance.

Total Score Prediction

Metrics on model.

F Score on predicting variables

Complex Model XGboost Regression

Metrics on model.

F Scores on predicting variables.

Discussion on Lazy Predict

Scattered through the kaggle code is a module called lazy predict. Lazy predict is a python library that provides a quick and easy way to test and compare multiple machine learning models using a single line of code. By default, lazy predict uses the default hyperparameters for each model. See below for output on complex predicting variables for the regression.

I learned of a few new models from this module. Since the above uses default hyperparameters; I took the LassoCV, OrthogonalMatchingPursuitCV (OMP) and XGBoost models and used code to find the optimal hyperparameters. Hyperparameters are distinct from the predicting variables of the model, which are learned from data during the training process. Examples of hyperparameters include the learning rate, regularization strength, number of hidden layers in a neural network, batch size, and number of iterations or epochs for training. I believe you can adjust the lazy predict module to find optimal hyperparameters for each model, but my 2018 8gb of ram MacBook Pro would probably explode going through that many iterations. Rather I just found optimal parameters for these three models. See results below for the three models.

The above would indicate to use the OMP model, however I went with XGboost based on my comfort level with XGboost.

There is also a lazy predict module for classification. See below for results on the simple model.

The results from the lazy predict module for classification indicates it may be worthwhile exploring the SVC further for better results. For now, stuck with the LogisticRegression.

Train/Test Split Logic

Developing an effective model requires careful consideration of how to split the data into training and testing sets. In our case, we obtained three years of game data for both regular season and playoffs, resulting in a dataset of 802 games.

A test size of 0.2 was chosen to allocate 160 games for testing, ensuring that sufficient data was reserved for training the model without risking overfitting. This balance between training and testing data is critical, as an excessive amount of data for training can cause overfitting and compromise the model’s ability to generalize to new data.

#split the data into training and testing
X = merged_df.drop(['ScoreId','Day','DateTime','Status','AwayTeamName','HomeTeamName','HomeTeamScore','AwayTeamScore','TotalScore','HomeRotationNumber','AwayRotationNumber','PregameOdds','GameOddId','Sportsbook','Created','Updated','DrawMoneyLine','SportsbookId','dayofyear','dayofmonth','HomeTeamCover','AwayTeamCover','BetOutcome'],axis=1)
y = merged_df['HomeTeamCover']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1234)

It is important to note that the process of splitting the data into training and testing sets can be iterative and may require trial and error to determine the most effective approach. Nonetheless, by thoughtfully splitting our data, we can build a reliable and robust model for predicting the outcomes of NFL games.

We also used a random_state, which randomly shuffles the data. Typically with time series forecast you take the first ~80% of data for the training set and the last ~20% as the testing set, making sure the split is done sequentially. To ensure that our model was not biased towards any particular weather pattern and to account for the inherent unpredictability of NFL games, we utilized the random_start parameter as part of our data splitting approach.

Machine Learning Model for NFL Betting (Model 5.0)

Results on Analysis

Predicting Variables

Visuals and Analysis on Data

Deeper Dive on Model Accuracy

Discussion on Lazy Predict

Train/Test Split Logic

Written by The Factory of Sadness