Predictive Analysis On NFL Plays

6 min readJan 26, 2020

American football is an exciting yet complex sport. From the 22 players on the field to certain characteristics appeared throughout the game, it is really difficult to quantify the value of specific actions within a play. Traditional metrics such as ‘yards per carry’ or ‘total rushing yards’ can be inaccurate; inspired by Kaggle, I will develop a model in Python to predict how many yards a team will gain in a play when a ball carrier takes the handoff.

Before approaching to detailed analysis, we first will need the following libraries:

import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplotimport pandas as pd
import numpy as np
import datetime as dt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import xgboost as xgb
import keras
from sklearn.metrics import roc_curve, auc
from sklearn import metrics
import math
from string import punctuation

Exploratory Data Analysis

Many analysts tend to dive into cleaning the data and building models as soon as they get the data. However, it is often necessary to implement Exploratory Data Analysis (EDA) before approaching to this step, because EDA would help analysts not only to detect mistakes, check assumptions, and select the appropriate algorithms, but also to explore the relationship among variables, assess the rough size of relationships, and, most importantly, refine the featured variables that would be used in the next steps.

In my EDA, I first assessed the relationship among variables through a correlation matrix. However, it seemed that many variables are not highly related. Additionally, the graph only provided with limited useful information, so, in the next step, I also explored the distribution of some features.

One interesting fact I found from the graph is that in most plays, teams will not gain too many yards when a ball carrier takes the handoff. The gained yards concentrated in a range of -5 to 10. Therefore, my predicted results should also be concentrated in this range. If most of my predicted yards are above 20, the I would know that there should be a mistake in my model.

As a new football fan, I still don’t fully understand some of the strategic moves or formations. So, I added the positions of players to a blank football field and created some more directed graphs to help me understand the formation. The following graph showed the formation of one play. As we can see, two players from the defense team stand a little bit away from their team members. On the other side, the rusher and another member from the offense team stand very closely to each other. Moreover, the members of the offense team overall stick more closely than members from the defense team. Thus, maybe certain team formations and the distance among different members have important effects in the model I am going to build.

Finally, since I have spent four years at the Ohio State University, I know we have a very competitive football team. Therefore, I think that the player’s college names might be related to the yards gained in a play. One criteria to evaluate the competency of a college football is to see the number of graduates who joined NFL. As a result, I made a world cloud to identify competitive college football teams. From the cloud, Ohio State University, together with the other two, are the most identifiable teams, which suggest that these colleges are more competitive overall.

Feature Selection

Since the original dataset provided by Kaggle contains too many variables, my first task is to combine some of these variables into single variables and create new variables. For example, the dataset contains players’ height and weight, and they could be replaced by the Body Mass Index (BMI), which is calculated by height and weight. Additionally, as discovered from the previous EDA and suggested by many professional football analysts, the average distance to team’s centroid as well as the average distance between each player and quaterback are two important predictive variables in predictive analysis. Therefore, I also defined and added these features in the original dataset.

Partial codes for redefining variables

Data Cleaning

The last step before building the model is cleaning data. In the previous step, I already redefined some of the variables to transform the data to the same form. For example, under the turf variable, “grass” and “natural grass” actually refer to the same kind of turf, so I relabeled them as “natural” which reduces the chaos when building the model. Additionally, one tricky part of the data is that some team abbreviations are wrong. Arizona Cardinals is supposed to be ARZ instead of ARI recorded in the dataset. Baltimore Ravens is supposed to be BLT instead of BAL, etc. In this step, I further processed the data to suit the algorithm used in machine learning models. The reason that I removed the categorical features is that there would be a bug when scaling the data if keeping the categorical variables.

Building Models

Instead of using XGBoost, I used LightGBM to build my model. Similar to XGBoost, LightGBM is also an gradient boosting algorithm based on tree model. It is developed by Microsoft with faster training speed and higher efficiency. Some data scientists claim that LightGBM is the improved generation of XGBoost. As I mentioned before, once the data is cleaned, building a model then becomes very easy. Through adjusting the parameters of the model, it becomes more and more accurate. I also used cross validation when training the model, which allows the model to improve by itself.

Predict Play Results

Finally, let’s predict the yards! Since the original data from Kaggle is automatically split into two parts, train part and validation part, I also need to process the validation part of data, just like the way I processed the train part of data (that’s why I defined a lot of functions in my previous steps). One thing to notice is that the results don’t return one single number for each play. Normally, the model will return the predicted value that has the highest probability to occur. However, in this scenario, I want to see the distribution of these probabilities. In other word, I want to know the probability of each one of the 200 results, ranging from -99 yards to 99 yards. As a result, the final output is a form with 200 columns filled with probabilities associated with each one of the 200 results.