Can I predict if a football team will get a first down?
At Kaggle I found a dataset with play-by-play data from the NFL season 2015. After looking at the data I came up with the following questions:
Some questions
- Can I predict if a team will reach a new first down in the next play given quarter, time left in quarter, current down, yards to go and position on field?
- Can I give a probability for reaching the first down?
- Is there a difference between the teams on offense?
The data
The original data contains 66 columns. One of them is FirstDown which I want to predict.
I chose the following columns for prediction:
- qtr: In which quarter occurred the play?
- down: Which down was played?
- TimeSecs: How many seconds were left in the quarter?
- yrdline100: How many yards to the opponent’s endzone?
- ydstogo: How many yards to go for a new first down?
And later
- posteam: Which team is in possession of the ball?
Cleaning the data
There were many rows with NaN as value for FirstDown. But these plays were plays I wasn’t interested in such as kickoff, PAT, timeout or end of quarter. So I could safely drop these rows.
The data was also some kind of imbalanced: Two out of three plays weren’t plays for a new first down. So I decided to upsample the data so that both classes are even distributed.
Different models
I tried several different models for classification. I decided to use the RandomForestClassifier.
Predictions
Now let’s do some simple predictions. Let’s create some cases:
- The first play of a game: Own 20-yard line, 1st and 10.
- 20 seconds to play in the game, 15-yard line in the redzone, 4th and 10.
So what are the predictions?
array([0., 0.])
So in both cases no first down is predicted. But what about probabilities?
Probabilities
I’m using the method “predict_proba()” to get the predicted probabilities:
array([0.3712671, 0.45 ])
The second case has a slightly larger chance for reaching the first down.
How do the probability change in respect to the position on the field?
We now take a look at the following situation: 3rd and 4 at mid-quarter for each yardline on the field:
For each of this situation we predict the probability for getting a new first down.
At your own endzone it’s hard to get a first down. At midfield it’s a way easier. But it gets harder if you come to your opponents endzone.
What about different teams?
I did the same analysis as above for each of the NFL teams. So I get 30 different models. I decided against one model containing the team because I would have to 1-hot-encode the column “posteam” and would have to do it for prediction, too. So prediction is easier with 30 single-team-models.
SuperBowl teams vs. 3–13 teams
The Denver Broncos (12–4) beat the Carolina Panthers (15–1) in SuperBowl 50. So let’s compare both to two 3–13 teams: Tennessee and Cleveland.
As you can see both SuperBowl teams have higher probabilities than the bad teams.
Getting first downs helps winning championships!
Wrap up
So with a relative short program you can dig into the math behind football. (Disclaimer: As I’m born and raised in Germany I haven’t thrown any football at any level. I’m the classic arm-chair-quaterback ;-) )
Next steps
- You can combine the play-by-play data to drive-data. So you can put the result of a drive to every play, trying to predict the result of a drive depending on field position etc.
- Only look at the last two minutes of a game or a half. Is there any QB you want to play for your team? (Is it Aaron Rodgers?)
- Is there a development in time from season to season? There is a R-package nflscrapR you can use to download more data.
You can find the code for this analysis on github: https://github.com/dfv-ms/udacity_nd025_term2_project_1