Building a Basic, In-Game Win Probability Model for the NFL
The goal of an in-game win probability is to estimate the probability that a particular team will win a game based upon the game conditions (score, time remaining, etc.) at a particular point of time in the game. For example, a team that leads by 21 points with less than one minute remaining in the game would safely be assumed to have a win probability approaching 1. How would that probability differ if the team leads by a single point? Or trails by two points? Or leads 28–3 with 8 minutes and 31 seconds remaining in the 3rd Quarter of the Super Bowl (see the plot below)?
In this post I’ll show the development of a basic, in-game win probability model for the NFL in R. We’ll start with historical play-by-play data scraped using the wonderful nflscrapR R package. This package is not currently available via CRAN. Instead, the package is installed directly from github. Other libraries used in this exercise are also shown in the R code snippet below.
devtools::install_github(repo = "maksimhorowitz/nflscrapR")
Note that I’m still adapting to using tidyverse principles so I’m not always using the package consistently. For a solid introduction to tidy principles, check out the excellent “R for Data Science” book by Grolemund and Wickham.
Setting-up the scraping process is pretty trivial, but the scraping itself can be a bit time-consuming. So as to not have to repeat the scraping, I use the saveRDS function to save the play-by-play data to a file for use in this project and others. Repeat the code below for the 2009 through 2015 seasons (nflscrapR cannot scrape data from before the 2009 season). You should end up with eight pbp data frames numbered 1 to 8.
pbp1 = season_play_by_play(2016)
Then use bind_rows to combine the the pbp data frame into a single data frame.
pbp = bind_rows(pbp1,pbp2,pbp3,pbp4,pbp5,pbp6,pbp7,pbp8)
The resulting, combined data frame should have 362,447 observations (rows) and 77 variables (columns). Each observation is a single play and the variables provide information about each play. To build our model, we’ll need each row to also include a variable that indicates which team ends up winning the game in which each play took place. Fortunately, the nflscrapR package provides the ability to scrape game result data. From this data we can derive the needed variable.
games2016 = season_games(Season = 2016)
Repeat for the 2009 to 2015 seasons and then combine the game results data. I save the combined data to file to avoid having to repeat the scraping.
games = bind_rows(games2016, games2015, games2014, games2013, games2012, games2011, games2010, games2009)saveRDS(games, "games_data.rds")
The full_join function is used combine the game results with the play-by-play data using the GameID variable as the key for the join.
pbp_final = full_join(games, pbp_raw, by = "GameID")
I then created a new, binary variable to record whether or not the team in possession of the ball is ultimately the game winner. This variable will be the response variable in our models. The first line of code creates a variable that stores the name of the team that won the game in which each play took place. The second line creates an indicator variable with a value of “Yes” if the team in possession ultimately wins the game and “No” if not. I also convert the quarter, down, and “poswins” variables to factors. Note that I got a little lazy with my code and did the factor conversions in a non-Tidyverse manner.
pbp_final = pbp_final %>% mutate(winner = ifelse(homescore > awayscore, home, away))pbp_final = pbp_final %>% mutate(poswins = ifelse(winner == posteam, "Yes","No"))pbp_final$qtr = as.factor(pbp_final$qtr)
pbp_final$down = as.factor(pbp_final$down)
pbp_final$poswins = as.factor(pbp_final$poswins)
In the next step we remove “No Play” plays and plays that did not occur during regulation. A subset of variables of interest is then created using the select function.
pbp_reduced = pbp_final %>% filter(PlayType != "No Play" & qtr != 5 & down != "NA" & poswins != "NA") %>% select(GameID, Date, posteam, HomeTeam, AwayTeam, winner, qtr, down, ydstogo, TimeSecs, yrdline100, ScoreDiff, poswins)
We’re now ready to build our prediction model. Before we do so, let’s split the dataset into training and testing sets with the sample.split function from the caTools package. Setting the seed ensures that the split we create is reproducible. We’ll use the testing set to evaluate the performance of the model that we create using the training set.
split = sample.split(pbp_reduced$poswins, SplitRatio = 0.8)
train = pbp_reduced %>% filter(split == TRUE)
test = pbp_reduced %>% filter(split == FALSE)
A wide variety of models exist for binary classification problems (such as we face here). One of the simplest is logistic regression. Coefficients, shown in the logit function below as betas, are estimated. From these, the probability, P, of a binary event occurring can then be estimated.
In our model, the predictor variables (shown as X’s in the logit function) are qtr (quarter), down, ydstogo (yards to go), TimeSecs (time remaining in the game in seconds), yrdline100 (distance from the opponents goal line in yards), and ScoreDiff (difference in score calculated as the score for the team in possession minus the opponent’s score). The response variable is poswins which is a binary variable that indicates if the team in possession ultimately wins the game.
We use R’s glm function with family = “binomial” to build the logistic regression model on the training set. The summary function then provides the details of the model results.
model1 = glm(poswins ~ qtr + down + ydstogo + TimeSecs + yrdline100 + ScoreDiff, train, family = "binomial")summary(model1)
The model can then be used to estimate win probabilities for each play in the training dataset. The estimated probabilities are predicted for the team in possession. For clarity, the third line of code in the snippet below ensures that the probabilities are always stated for the home team (even if that team is not in possession).
pred1 = predict(model1, train, type = "response")train = cbind(train,pred1)train = mutate(train, pred1h = ifelse(posteam == HomeTeam, pred1, 1-pred1))
The plot below shows the evolution of the estimated win probability for the Denver Broncos (home team) in their September 8th, 2016 game versus the Carolina Panthers. The Broncos won 21–20 as the Panthers missed a potentially game-winning field goal with nine seconds remaining.
ggplot(filter(train, GameID == "2016090800"),aes(x=TimeSecs,y=pred1h)) + geom_line(size=2, colour="orange") + scale_x_reverse() + ylim(c(0,1)) + theme_minimal() + xlab("Time Remaining (seconds)") + ylab("Home Win Probability")
Earlier, we split our dataset into training and testing sets with the idea that we would evaluate our model quality on the testing set. To do so, we need to convert the estimated win probabilities from the logistic regression model to a definitive prediction of “Win” or “Loss”. How do we do this? We establish a threshold above which we assume that a “Win” a predicted and below which we assume that a “Loss” is predicted. For example, if the estimated win probability at a particular moment in a game is 0.95, we might be able to safely assume that the team will win the game. What if the estimated probability is 0.75? Or 0.65? Or 0.55? Where do we draw the line? Should we simply accept 0.5 as the threshold? We’ll address this issue and opportunities to enhance our model in the next post.