NFL Play Prediction Using Computer Vision

Published in

Geek Culture

8 min readMay 12, 2021

Players lining up at the goal line to prepare for the next play. (Souce: Flickr by SteelCityHobbies)

National Football League (NFL) is the biggest sport industry in the United States. Due to its highly competitive nature, the NFL attracts a massive audience around the globe. In particular, during the Super Bowl, more than 100 million viewers watch the game and the cost of a 30-second commercial is around 5 million dollars.

To win the Super Bowl, players and coaches must commit a year-long effort. Nowadays, advanced equipment and advanced data analysis make it NFL teams easier to find useful information from accessible data to help players improvise on their game.

In this project, we conducted a simple binary classification problem to test the potential of implementing computer vision in the NFL. More specifically, given an input, the model predicts the type of play: run or pass.

We used two different kinds of input, unstructured and structured data. Each input consists of 4 frames of players positioned on the field and in-game statistics. Statistical data include scores, remaining time, number of downs, and the location of the ball in the field when the frames were captured.

Methods

We proposed two different types of models to predict run or pass: 3D CNN and Pre-trained Resnet3D. Those two models were trained with 4 frames of in-game video and statistical data, while a baseline model was trained using only the statistical data.

Baseline Model

In-game statistical data can be informative to the audience. It is somewhat possible to predict whether the offense will run or pass by just looking at the in-game statistics. For instance, when the play is at third down and there are more than 10 yards to go, it is most likely for the offense to pass since passing has a higher chance to advance 10 yards than running. Furthermore, if the ball is placed at the 1-yard line at the opponent's side, the offense will most likely run because it is a safer choice to gain shorter yards.

Just like how we expect the type of play by inferring to stats, we believe the model can also learn the game of football with the statistical data.

We composed a simple artificial neural network as our baseline model. The architecture of the baseline model is shown below.

3D Convolution Neural Network

Image contains unique information of the game of football that in-game statistics lack. Image contains information on the positions of players on the field, which is essential when predicting a play type.

In the game of football, the defense guesses the type of play by looking at the formation of the opponent players. Meaning, there is a purpose in how each offensive player is lined up at the scrimmage in a certain position.

Without any information on the player’s position, it would be hard to predict the play accurately.

To present an analogy, using only the in-game statistics is like predicting the play by listening to the radio, and using both the images and in-game statistics is like watching the actual game.

Clearly, people can predict better when they are actually watching the game. Thus, we decided to use a 3D CNN to utilize both image and in-game statistics.

In 3D CNN, the model further extracts features from both the temporal and spatial dimensions. More than one spatial-temporal image is arranged in a contagious manner to build three-dimensional space to make three-dimensional space and this arrangement enables the model to extract motion information within the different images/frames of an input.

In our project, we can utilize the frame difference approach to predict the play type. In the game of football, before the ball is snapped, wide receivers and running backs move around to suddenly change the formation to throw off the opponent with a different play. Thus, we decided to capture frames of players moving right before the ball is snapped. A total of 4 frames per play were retrieved and each frame captures a different temporal offensive formation. This allows the model to learn the relationship between the motion schemes and the play type

There are two stages where inputs are being introduced. First, 3D convolution neural network intakes image data (N,3,4,256,256) to process. Then, after images are flattened into one array, in-game structured data (N,9) is introduced at the fully connected step. The model is further trained to learn, using both image and in-game statistics.

Intuitively, when watching the game of football, the audience can predict the play by just looking at the formation of players. For example, if there is no running back around the quarterback, the play will most likely turn out to be a pass. On top of that, if the model can infer to in-game statistics, it will have better control over the game. We do think the model can learn various patterns of formations well enough to accurately predict the play.

Dataset

We obtained our data from NFL-22, which is an official NFL film site. We only focused on the Kansas City Chiefs, since they were the defending champion in the 2019–2020 season >:D

Each team has 16 regular-season games and 4 pre-season games. In a game, the team usually has 50 to 60 pass and run plays. For each play, we captured 4 frames of the players on the field in the following manner: 3 seconds, 2 seconds, 1 second before the ball is snapped and right at the moment the ball is snapped. Along with the frames, we recorded the remaining time, score, and the number of downs at the moment the play was captured.

Data Preprocessing

Once 4 frames were captured for each play, we manually classified the play with the correct label. Then, we saved images into the corresponding folders: run or pass.

Frames were cropped and resized. We tried to crop as little as possible to preserve players' locations. There were some plays that did not contain 4 frames because the play started as the film began to record. In such cases, we duplicated the previous frame to fill up the missing frames.

The image inside the red border reflects the preprocessed image.

Both images and in-game statistical data were subtracted by the mean across every individual feature in the data. This enforces every feature to be centered around the origin.

Different stadiums have different camera views. Thus, each film is shot at a different angle. However, we did not align or transform images to construct one uniform view, because the model will be less generalized. We want the model to be flexible to learn general features on different camera views. We believe different angles will produce various information about the team’s formation before the play begins.

After collecting 1000 samples, we noticed the dataset is highly imbalanced. The dataset consists of about 67 percent of pass play and 33 percent of run play. Having an unbalanced dataset could lead to a major problem when running the model. It could decrease the model’s performance when the quantity of the pass is substantially larger than the run because the model could possibly learn to always predict a pass play no matter how players are positioned on the field.

In order to balance out the dataset, we decided to flip run images right to left, doubling the amount of run play. Such image augmentation is valid in our case because by flipping the image horizontally, we are just changing the side of the field while everything else about the game is preserved. After augmenting the data, we have captured a total of 1355 plays (5420 frames). More specifically, 689 pass plays and 666 run plays were collected

Result

Clearly, 3D CNN performed superior to other models, scoring the highest in F1, recall, and precision in both play types. This indicates that the model does relatively well on predicting both types of play and there is no bias in its decision making.

We have sampled out images that were being correctly classified vs incorrectly classified to inspect what type of formations threw off the model during the prediction.

Left) Conv3D inaccurately predicted images. Right) accurately predicted images

We will indicate incorrectly labeled images as false images and correctly labeled as true images. If we compare false images to true images, we can clearly see that false images contain more noise, such as shadows on the field, crowds, and players showing up on the side, and a patch of green grass missing on the field. Having crowds in the background can definitively confuse the model on which group of people to focus.

Moreover, even with our human eyes, it is sometimes hard to locate players who are covered in the shadows. The model could, perhaps, exclude parts of the team and only consider parts of the formation. On the other hand, true images show nothing but players lined up at the scrimmage. We could hardly see any obstacles that could possibly hinder locating players. From this analysis, we were able to see the importance of the quality of data and the existence of the variability in the data.

Limitation

The main reason for the relatively low accuracy of the model is due to a small dataset. Even though we managed to obtain more than 1300 samples, it is not sufficient enough to train the model to pick up the complexity that involves in the game of football in order to predict the play type. Also, within the data itself, we are able to see some varieties. For example, the Kansas City Chiefs has two uniforms: red for home and white for away. Different colors can confuse the model when deciding which side is the offense with limited training data. Furthermore, many games were played outdoors so the data can vary with different weather conditions. For instance, some games included a significant amount of rain, snow, and shadows on the field, which makes it harder to locate players. All of these varieties can add complexity to the problem and impede the process of learning.

Conclusion

In this project, we conducted a binary classification problem to predict the type of play, run or pass. Overall, we showed that the 3D Convolutional Neural network can extract spatial and temporal information efficiently.The result proves our hypothesis that the model performs better when trained with both images and in-game statistics. The Conv3D model achieved 70 percent in the test set with a relatively small dataset. This conveys that there is a promising potential in implementing deep learning in NFL once trained with sufficient data.

List your favorite NFL team in the comment! Happy Touchdown everyone :D