I Can (Sort of) Predict An NFL Pass Situation Now

No big deal.

Published in

Analytics Vidhya

7 min readOct 27, 2019

I have an intense obsession with football. At first it was the Super Bowl Commercials that drew me in, but after watching football breaks between Commercial Bowls, I found I got super hyped whenever fights happened, devastated when important players got injured, and constantly amused at the hundreds of zoom ins of the team watching their team. It was a testosterone-fueled drama, and I was there. for. it.

Sometimes we get super zooms that show an existential crisis happening in real time.

As my love grew, I began noticing and appreciating the various statistics that were shared by announcers during game time. If you watch football long enough, you begin to pick up on the natural correlations between stats and their significance in plays that are happening, or situations that are unfolding. Quantifying that back into a data set, however, really required me to break down events to measurable units, a fascinating and tedious process. I know it sounds like a drag, but truly, that’s my favorite part of the process for Data Science (besides cleaning data, it’s weirdly relaxing). The more I learn how to break down the problem of Not Enough Data into ways to making data, the more I learn not just about the sport and data analysis, but how Natural Philosophy can become a great ally for the data scientist. Logic, and the ability to break down questions into tangible metrics is almost like a puzzle game, but one filled with wonder. I digress. Back to Grid Iron Football.

About The Data:
Source: http://www.nflsavant.com
I used 3 of the datasets in the API: 2017 Regular Season, 2018 Regular Season, and 2019 Current Regular Season.
Not included: Pre and Post Season games, as well as Championship games.

Link to Analysis Notebook: https://colab.research.google.com/drive/1PJ5lWIDvoFudqjYCMHHp_NgLujF-tNz8

While I have so, so many questions I want to answer, I had to start somewhere, so I went for one of the most common plays in the NFL, the Pass. My hypothesis: Using only pre-snap data features, an algorithm can predict whether or not a QB will Pass the ball (versus Rushing it) better than the majority baseline (that is to say, the actual percentage of plays in the data set that are Passes versus other data).

Majority Baseline: 58% (Passes happen 58% of the time)

Passes Per Minute

The initial exploration showed that passes are more often to happen at the beginning of a quarter, before leveling out and then dropping towards the end of the quarter. I also checked the seconds on the clock but it didn’t seem to affect the play option. I don’t believe this means that seconds on the clock has no predictive power, I think I’m missing a key way of quantifying the importance of seconds on the clock for a player.

Passes per Quarter

Quarterbacks definitely prefer to throw the ball more in the second and fourth quarter, which makes sense, if you follow the game. The end of the first half has a lot of moral impact for the team. The pressure to keep the game to a winnable margin for the Defense is their top priority if they’re behind, and play calling may shift to a riskier stance, knowing they have a break to rest and regroup, and two more quarters to make up for any mistakes.

Passes per Yardline

Rushes dominate at the Offense’s endzone, whereas they tend to Pass more frequently (much more frequently) when facing the goal line. I also took time to check density, noting that between 35 yards and 50 yards from the Goal line was a hot spot for Passes as well. This shows the relation between the risk the QB of the Offense has in his own endzone, versus when they’re knocking on the door of the Defense. Formations likely have a lot to do with the endzone success as well, so I’m interested to see how they show their importance in the prediction model.

Based on my exploration, and the relevance I saw in those visualizations, among many, many others, I chose to focus on Time and Downs for my first model. I was surprised to see that Seconds had almost no predictive power, but I felt it necessary to keep checking it throughout my models just to make sure I wasn’t missing secret information. I wasn’t, but after I finished and started compiling my data to share, I realized that I could have focused the power of seconds by adding in certain high pressure scenarios, such as the 2-minute warning threshold, or how close the game was in terms of points. Basically, NFL data engineering is a gateway data project that sucks you in and may be the ultimate still-in-progress portfolio piece, but I’m okay with that. I have a project to keep me warm during the ice cold months of the off-season.

Results

Chewing intensifies….

Rather than walk you through the severe emotional turbulence of me iterating through models and googling all the things to make a model that performed better than the majority baseline, I figured I’d suppress my trauma and just share some pretty visuals that give some interesting feedback on my data.

Was it a Formation? If Yes, Check the Down. If No, Check Yards To Go to get a First Down.

Fast First Model/Shallow Decision Tree

Decision Tree: 69.45%

Immediately, the Decision Tree model made good on the information, giving an astonishing 69.45% accuracy, almost 12% higher than the majority baseline. I’ve heard it said over and over again that a Shallow Decision Tree is the best baseline classification model a data scientist could ask for, and they were right. By starting with a shallow tree with a max depth of 2 (as seen above), I immediately had a way of beating the majority baseline, and easy way to check my model for leakage. The question was, how was it predicting specific values?

Here’s a comparison on how it judged between a play that was a pass, and a play that was a rush, using a ‘shap.plot’ (shaply? shapley? shap-lee). Both graphs show an instance where the model read the information, and made an accurate prediction.

**Was an actual Pass, model correctly predicted it was a Pass**. Blue shows what values took away from the models ability to make a prediction. Basically, Formation was hot garbage as a feature.

**Was a Rush, model correctly predicted it was a Rush.** Interestingly, Formation actually had a high importance here in indicating whether or not the ball would be rushed. The way the team lined up, it seems, gave great predictive power for when the team would Rush, and was used to help the model discern how to tell the difference.

Logistic Regression

using ROC AUC as accuracy metric
Majority Baseline: 58.7%
Shallow Decision Tree:69.45%
Test Data: 66.56%
Validation Data: 67.51%

This shows that Downs, Quarters, and ToGo (Yards the Offense had left to get a first down) had the highest positive impact on our prediction capabilities, whereas, somewhat initially surprisingly, Formation had the least to do with whether or not the ball was Passed or Rushed. Again, I feel compelled to question my own quantification methods for harnessing the potential of Formation here, but I think that lies in quantifying position importance per formation. That’s my guess. That, or, the play’s ability to be used to hide trick plays. OR it could just be that the decision to Pass lies within the first few seconds post snap.

Decision Tree, Max Depth 6

Majority Baseline: 58.7%
Shallow Decision Tree (depth of 2): 69.45%
Test Data: 70.02%
Validation Data: 70.08%

So Logistic Regression wasn’t gonna work, clearly, but that baseline model was FIRE, so I decided to tweak the hyperparameters and see what we could get. I followed the methodology of pushing the model until it started losing predictive power, and then scaling it back. At a max depth of 7, I got a test accuracy score of 69.82%, but when I moved it down to 6, my score rose to almost 71%.

As magical as Aaron Donald lifting a grown man off the ground and dangling him in the air.

XGBoost Classifier

Mean Baseline: 58.7%
Shallow Decision Tree (depth of 2): 69.45%
Test Accuracy: 70.28%
Validation Accuracy: 70.16%

Ultimately, XGBoost was my best model, with a Validation accuracy juuust a smidge better than the Decision Tree. It turns out gradient boosting was essential to my algorithm to get the most of out of the limited columns I gave it to train and test with. All in all, a modest 70.16% ROC AUC on Validation Data was a pleasant surprise, but I know the more I look at this draft, the more features I feel I can engineer to improve upon this.

Conclusion:

My hypothesis was correct! I can, in fact, predict the Pass with more accuracy than the majority baseline. What’s that Todd Gurley? I’m what?

Oh Todd, you sly minx, how you do go on ❤

Whats Next?

The better I do on this model, the better my chances of getting Sean McVay to notice me and hire me as a data scientist. My next step will be watching the Validation Score as the season goes on and more data pours into the dataset as the NFLSavant API updates week to week. I get to see how well my model predicts on new data, so I’ll be watching and adjusting features as I go!