Modeling Football: Combining ML models and Monte Carlo simulations

Published in

DraftKings Engineering

11 min readFeb 7, 2024

"Play clock to 5. Pass is… intercepted at the goal line by Malcolm Butler. Unreal…” Super Bowl XLIX

"Final play of the game in regulation… It is caught by Dyson. Can he get in? No, he cannot. Mike Jones made the tackle. And the Rams have won the Superbowl…" Super Bowl XXXIV

"Third down and 5. Pressure from Thomas off the edge… Eli Manning stays on his feet, airs it out down the field… and it is caught by Tyree inside the 25 and a timeout taken. Oh my god…" Super Bowl XLII

Every sport has iconic moments that fans can recognize regardless of whether they watched it live or have only seen it replayed thousands of times since. Football is no different and the three above are examples of some of these most dramatic moments on the biggest possible stage. In this article, we will look in detail at how we simulate football games on a play-by-play basis to determine the event probabilities that power the DraftKings betting markets.

Football is simultaneously the most important sport to DraftKings and its customers and arguably one of the most challenging sports to model. The complexity and nuances of the sport that make it such an enthralling spectacle for both spectators and bettors alike also make it a fascinating challenge for the Sports Data Science team to tackle.

In a recent article, we showed how you could build a simple Monte Carlo simulation model for tennis. Here, we shall look at how the simulation approach to modeling is ideally suited to the more complex sport of football and how we incorporate machine learning models to drive the key decision points within matches.

Play-by-play approach

In our football engine, we simulate the game on a play-by-play basis. This involves determining the type of play and then, depending on the type of play, we determine various factors that lead to a description of the outcome of the play. This may be a simple one-stage process, such as whether a field goal attempt was successful or unsuccessful, or it could be a multi-stage process, such as we use to determine the outcome of a passing play.

Once we have simulated the play, we update the simulated features and the simulation records and check whether that leads to the end of the match. If it does not, we simulate the next play, continue simulating plays, and update the features and records until the game ends.

In the tennis simulation, we built, the simulated features contained features such as the score in the current game and the match, which player is serving and whether we are in a tiebreak. In other words, they were the game state features that we updated throughout each simulation that we then used as features in the ML models that drove the decision points. In the football engine, it is very similar — the important simulated features all describe the state of the game at any given moment in the simulation. They will include the current down, the field position, yards required for a first down, and the score, as well as player-specific features like rushing yards for running backs, receiving yards for receivers, etc.

After every play, we update these simulated features based on the simulated outcome of the previous play. We will also update the simulation records for that particular simulation, adding to the appropriate record details of the last play.

Now that we can understand the high-level simulation flow of a football game, how do we go about simulating each play?

Simulating a play

For a standard non-fourth downplay, the first key decision point in the engine is the action classifier model. This is a team-specific binary classifier that predicts either a pass play or a run play that takes inputs such as the quarter, the current field position, the yards required for a first down, the down, and the score difference, amongst others.

We use logistic regression models at the decision points in many of our sports engines. The logit function is beneficial in sports modeling as it can map probability values between 0 and 1 to real numbers between -∞ and ∞.

This transformation allows us to apply additions or subtractions to our models based on the features while ensuring that we constrain the model's output to viable probabilities between 0 and 1. We can perform whatever adjustments we desire based on the input feature in logit space before transforming it back to probability space.

As a simple example, let's say that in a 3rd and three situation, a team has an 80% chance of rushing the ball. In a 3rd and one situation, they have a 25% greater chance of running the ball due to the shorter yards remaining. In this situation, we cannot add 25% to the 80% as that would fall out of the bounds of viable probabilities. These linear shifts do not work well in probability space. We can apply this linear shift by transforming our probabilities into logit space and then converting it back to probability space. This will cause the probability to trend toward 100%. Still, since the logit function has asymptotes at -∞ and ∞, corresponding with 0 and 1 probabilities, we can be confident that we will never end up with non-viable probabilities.

We will take a single sample from the probabilities generated from the classifier to determine the type of play for each play. As explained in the previous article on simulations, we use a random number to sample from the probability distribution.

Based on the outcome of this single sample, we either branch off to the rushing yards model or the yards attempted model. As the flow becomes more complex, let us assume the play was a pass. Here, we sample from a multi-class classifier model to get a value for the yards attempted for this specific play in this particular simulation. Again, we use the benefits of the logit function here. However, as we require a set of probabilities due to the many possible outcomes, we use an extension of logistic regression — multinomial logistic regression models. These models generate the coefficients for each feature and return the set of probabilities for the outcome. The logistic element of the models ensures that the sum of these probabilities is 1. We train these models on historical data to generate the optimal coefficients that give us the best possible outcomes for our engines to use.

The features of this model are very similar to those of the action classifier — again, these are team-specific models tailored to capture the nuances of individual teams against specific opponents. Let us assume that we sample a value of 5 yards for the yards attempted.

Continuing the flow, we now have simulated that we have a situation where the QB will attempt a 5-yard pass. The next step is to determine the outcome of the pass — it can either be completed, incompleted, intercepted, or the QB can be sacked before making the pass. Again, this will be another multi-class classifier that we sample from. Let us assume the pass was caught. As with the yards attempted model, we now have the yards after catch model. At this stage, the features that drive this model will be mainly distance-related — how many yards for a first down, how long was the attempted pass, and how many yards are required for a touchdown, amongst others — although we can also add player-specific features, derived through a similar process to that described in the Kalman filter article. Earlier, we were focused on when in the game and where on the field we were to determine the type of play, but now we are more focused on the outcome of the play. Our sample from this model is 4 yards.

The final decision point is a simple binary classifier — whether the ball fumbled. We only hit this particular decision point on a completed pass or a sack — if the ball was turned over or the pass was incomplete, we do not need to make this decision. Based on the sample, the ball was not fumbled.

So, based on the samples of the various models, we have a passing play where the QB attempts a 5-yard pass, it is caught, and there are 4 yards gained after the catch with no fumble. This is precisely how the engine would simulate 'The Tackle,' the final play of Super Bowl XXXIV that was represented in the commentary at the start of the article.

Breaking down each match into individual plays and each play into its composite parts enables us to build models more accurately to reflect what is happening on the field. It is also an approach that lends itself to account for the correlations between outcomes crucial for our market-leading SGP product. It is evident that the number of attempted passes is correlated with the number of completed passes, the number of completed passes is correlated with the number of passing yards, the number of passing yards is correlated with the number of rushing yards, etc.

Building Complexity with Penalties

This approach allows us to simulate each play at a granular level. However, any football fan will be able to see that it is missing a lot of other critical decisions that must be made during a game.

However, this framework allows us to slot in any other models that we require to start capturing some of this added complexity. For example, let us look at how we would properly extend this to model penalties.

We can add a model at the start of the flow that can be used to predict whether there will be a pre-snap penalty. This multi-class classifier will allow us to sample whether there is no pre-snap penalty, a penalty against the offense, or a penalty against the defense. Without a pre-snap penalty, we will continue the flow precisely as planned. However, if there is a penalty, we will end the play, update the simulated features with the penalty by changing the field position 5 yards based on the offending team, update the simulation records with the information about the penalty, and then simulate the next play from there.

Similarly, we can add a model after the action classifier to simulate whether there was a post-snap penalty, and we can add an additional possible result to the pass outcome model to reflect whether there was pass interference on the play. We can also add a model to predict the yards of the penalty.

Non-Standard Plays

So far, we have only really looked at standard plays. However, we can easily extend out the logic to incorporate other types of play. For example, the decision-making process on the 4th downs significantly differs from the first three — we would not want to use the same models.

Instead, you can build an additional layer at the top of the existing flow to help us capture this in the simulation. The first key decision to be made when simulating a 4th downplay is whether to go for it or not. More and more in the modern day, teams are making aggressive decisions in these situations, and rather than punting the ball away or taking the points, they are going for it with either a pass or a rush play. We can build a model to sample from as to whether they will go for it or not — more so than almost any other model in the simulation; this needs to be a team-specific model to capture the different tendencies of different coaches in different situations. Unlike the majority of 4th down models that are in the public domain, we are not trying to predict whether a team should go for it — whether a team should go for it is irrelevant to us; instead, we care about whether the team will go for it—a small but important difference.

If we sample that the team will go for it, we can move into our existing play flow as before. However, if we sample that the team will not go for it, we will build a brand-new simulation flow that will predict whether they punt or go for the field goal and the outcome the same way the previous flow was done.

Time Run-Off

You can see how we can steadily build out the simulation flow to consider every aspect of the game and how individual plays can play out. There is one critical aspect that we lack in having a fully working simulation though — the passing of time.

The previous tennis simulation we built had no concept of time — as a point-based sport, the game progressed by scoring points rather than a game clock ticking down. However, football obviously has a game clock that ticks down during play and stops depending on the outcome of plays. As a result, we need to add in a model to our simulation flow that will predict the end of each play as to how many seconds to run off the game clock as a result of the play. As one might imagine, the features in this model are linked to the score and match situation — the score difference between the possession team and their opponents, the period and game clock, and timeouts remaining amongst others.

Once we have a sample from this model, we progress the clock by the number of seconds sampled, allowing us to progress through the match.

What Next?

On the face of it, there is nothing overly complex about this approach to modeling a football game. If you watch a game, each event you see, we model in the same order and break down plays logically. However, this approach allows us to build the models, taking advantage of the benefits of the logit function that makes these ideally suited to sports modeling, trained using features generated from our best-in-class data assets that the sports data engineering team has curated. It allows us to layer in complexity where required to price up the hundreds of markets we provide to our customers every week of the season.

Now that you can understand how we incorporate machine learning models into our simulation engines, we will dig into how these engines and models form the foundation of our in-house same-game parlay offering in upcoming articles. We will also detail how we approach the task of judging whether changes to our models and engines deliver the required improvements to move them to production.

Want to learn more about DraftKings’ global Engineering team and culture? Check out our Engineer Spotlights and current openings!