NFL Offensive Statistics Similarity From 2019–2022

Dominic Graziano
INST414: Data Science Techniques
3 min readDec 2, 2023

Data and Collection

Overall, this project seeks to find the closest NFL players based on their offensive statistics over the seasons of 2019–2022. The insights associated with this range from NFL front offices finding similar players to pursue in free agency or trades, as well as sports analysts and fans producing content for audiences. I found this dataset on kaggle and it could be found here. The dataset included a row for each game a player recorded an offensive stat for the season of 2019–2022 with the main positions being QB, WR, TE, RB. All of my code is located in a Jupyter Notebook and the libraries I were Pandas and Scikit-Learn’s cosine similarity.

Data Cleaning

To get the data into the format I wanted where there was a single row which held all of the aggregated stats for the player, I used pandas to group by the players name. From that point I decided that there were a couple of columns that were irrelevent and some which didn’t have numerical values so they were dropped. From that point all of the columns that remained were those which are the targets to compare and the analysis could begin.

Analysis

After cleaning the data I decided to hand pick some players including Lamar Jackson, Tyreek Hill, Gus Edwards, Mark Andrews, and Christian McCaffrey to find similarities. I made sure that there was some diversity in the position they played so that there would not be too much overlapping similarities. At this point I utilized Scikit-Learn’s cosine similarity and created a similarity matrix where each row was associated with the similarity scores for a player.

At this point I used a loop to go through my list of targeted players and find the top 10 similar players in the similarity matrix and appended these top 10 players to the target player in a dictionary. The output for one of my targeted players looks like this:

For Lamar Jackson these similar players make sense, as most of them have a dual threat ability or a tendency to run with the ball from the quarterback position.

The generated similarities for Tyreek Hill definitely made me skeptical with someone like Sterling Shepard being the most similar to one of the best recievers in the league over that period. It should also be noted that two tight ends showed similarities in this instance.

In the case of Christian McCaffrey I generally agree with the similarity of some of these players, with many of them having and showcasing both rushing and recieving ability from the runningback position and it to myself it makes sense that Alvin Kamara is the most similary to McCaffrey over that span of time.

Limitations

A limitation of this analysis is that we are using data over the span of a small number of seasons and aggregating that, so it doesn’t take into account the number of games a player hasn’t been in for injury or other conditions. I would also say that some of the similarities don’t make sense to me and should further be investigated such as Sterling Shepard being the top similar player to Tyreek Hill. I would also potentially look into filtering out and running the similarity metrics for each positon rather than the combined dataset of all positions.

Github Link

--

--