Will my team get a first down right now?

dfv-ms
4 min readAug 4, 2019

Can I predict if a football team will get a first down?

At Kaggle I found a dataset with play-by-play data from the NFL season 2015. After looking at the data I came up with the following questions:

Some questions

  • Can I predict if a team will reach a new first down in the next play given quarter, time left in quarter, current down, yards to go and position on field?
  • Can I give a probability for reaching the first down?
  • Is there a difference between the teams on offense?

The data

The original data contains 66 columns. One of them is FirstDown which I want to predict.

I chose the following columns for prediction:

  • qtr: In which quarter occurred the play?
  • down: Which down was played?
  • TimeSecs: How many seconds were left in the quarter?
  • yrdline100: How many yards to the opponent’s endzone?
  • ydstogo: How many yards to go for a new first down?

And later

  • posteam: Which team is in possession of the ball?

Cleaning the data

There were many rows with NaN as value for FirstDown. But these plays were plays I wasn’t interested in such as kickoff, PAT, timeout or end of quarter. So I could safely drop these rows.

The data was also some kind of imbalanced: Two out of three plays weren’t plays for a new first down. So I decided to upsample the data so that both classes are even distributed.

Different models

I tried several different models for classification. I decided to use the RandomForestClassifier.

Predictions

Now let’s do some simple predictions. Let’s create some cases:

  • The first play of a game: Own 20-yard line, 1st and 10.
  • 20 seconds to play in the game, 15-yard line in the redzone, 4th and 10.

So what are the predictions?

array([0., 0.])

So in both cases no first down is predicted. But what about probabilities?

Probabilities

I’m using the method “predict_proba()” to get the predicted probabilities:

array([0.3712671, 0.45     ])

The second case has a slightly larger chance for reaching the first down.

How do the probability change in respect to the position on the field?

We now take a look at the following situation: 3rd and 4 at mid-quarter for each yardline on the field:

For each of this situation we predict the probability for getting a new first down.

At your own endzone it’s hard to get a first down. At midfield it’s a way easier. But it gets harder if you come to your opponents endzone.

What about different teams?

I did the same analysis as above for each of the NFL teams. So I get 30 different models. I decided against one model containing the team because I would have to 1-hot-encode the column “posteam” and would have to do it for prediction, too. So prediction is easier with 30 single-team-models.

SuperBowl teams vs. 3–13 teams

The Denver Broncos (12–4) beat the Carolina Panthers (15–1) in SuperBowl 50. So let’s compare both to two 3–13 teams: Tennessee and Cleveland.

Comparison of SuperBowl teams with losing teams

As you can see both SuperBowl teams have higher probabilities than the bad teams.

Getting first downs helps winning championships!

Wrap up

So with a relative short program you can dig into the math behind football. (Disclaimer: As I’m born and raised in Germany I haven’t thrown any football at any level. I’m the classic arm-chair-quaterback ;-) )

Next steps

  • You can combine the play-by-play data to drive-data. So you can put the result of a drive to every play, trying to predict the result of a drive depending on field position etc.
  • Only look at the last two minutes of a game or a half. Is there any QB you want to play for your team? (Is it Aaron Rodgers?)
  • Is there a development in time from season to season? There is a R-package nflscrapR you can use to download more data.

You can find the code for this analysis on github: https://github.com/dfv-ms/udacity_nd025_term2_project_1

--

--