Final Project: Exploratory Analysis of NFL Fantasy Football Data: Unveiling Trends, Insights and Creating Positional Tiers through Clustering

Published in

INST414: Data Science Techniques

8 min readMay 10, 2024

Introduction

I have been an avid player in many different fantasy football leagues ever since I can remember. However, the last few years I have struggled to perform well. I decided that this assignment could be a great opportunity to perform a deep exploratory analysis on fantasy football data in the hopes that it will help me win a championship next season. For those unfamiliar with fantasy football, it is a game you play with friends and/or family during the course of the National Football League (NFL) Season. Everyone drafts players from NFL teams and then during each week of the season, they face off against other teams. Each player gets a certain amount of points each week dependent on their stat line from their game that week. The highest total score user team gets a win for the week and the other team gets a loss. At the end of the season, the best teams make the playoff and compete for the championship.

In the world of fantasy football, the player draft before the season is the most pivotal part of the season. Obviously, the person who drafts the best and finds the best undervalued players is the person who in most cases is going to win the league. The question I asked in the original module one assignment was:

“How do average draft positions (ADP) correlate with fantasy points scored by players and is there a clear pattern?”

In the extension of this module one assignment, i am going to add the additional question of:

“What are the top five tiers of players at each position?”

By showing people the top five tiers of players at each position the last few seasons, we can find some players who are in tiers higher than the average area they are drafted in relative to where players close to them in the draft order are in the list of tiers.

The datasets I am using comes from kaggle: https://www.kaggle.com/datasets/robertcurrie/nfl-adp-and-fantasy-pts-fantasy-pros-2020-2022

Within this dataset, they have data from the 2020, 2021 and 2022 NFL seasons. For this extend of the first assignment, I am going to focus on the data from all three season 2022 data. I only had a losing record in all three of my fantasy leagues during 2022, but it still makes sense to use more data.

Stakeholder and Decision Context

The stakeholders invested are fantasy football enthusiasts, analysts and team managers, especially those gearing up for draft seasons. Basically, the stakeholders are people like me, looking to improve their drafting ability for fantasy football. Answering this question will provide insights into different draft strategies, player evaluations and roster compositions. The answer directly informs decisions regarding player selection based on past season success, position and their current NFL team. Other topics that should be considered are the draft order and NFL team dynamics which do ultimately impact the success and failure of fantasy football teams throughout the season. For example, another question to consider is do players who play on NFL teams with a winning record generally perform better than players on teams with a losing record? As we analyze the data, stakeholders will want to see that I have at least considered as many factors as possible that can affect player performance.

Data Description and Relevance

As mentioned before, the datasets are from Kaggle and is titled “NFL ADP and Fantasy Pts — Fantasy Pros 2020–2022.” The data you can find is player name, position, team, average draft position (ADP), and fantasy points scored across multiple seasons. This dataset is perfect for uncovering correlations between ADP and fantasy performance as it provides a lot of data for both categories. In python, we can view, clean and organize the data enabling us to find patterns and trends that can significantly influence drafting decisions and team outcomes.

Data Collection and Exploration

To begin the analysis, I utilized the pandas library in Python to import and manipulate the dataset. Below is a the code used to load the dataset and perform initial operations:

In the above code, I loaded the two datasets for the data from the 2022 NFL season. Here is the output from the print statements:

As you can see, there is some data cleaning that needed to be done before I could do my analysis.

Data Cleanup and Preprocessing

During the exploration, I encountered a few common data issues but the most significant one was in the adp data, there were three columns with no values. Therefore, these columns are useless to use and will not be needed when analyzing the data. To address these rows, I implemented data cleaning procedures that removed these rows temporarily in python. Also, in the ADP data, I added a column that has the position label for each player, without the corresponding number rank. This will help me later during the data analysis. These changes do not affect the actual dataset that I have saved on my machine. Here is the code:

Here is the output of the print lines after the data clean:

In the extension of the module assignment, I have also done some additional data cleaning to the weekly data set. I have removed the data on each week, as we are more focused on the average points and total points over the total season. To see this, you can look at the code in the github repository.

Exploratory Data Analysis and Insights

For the primary analysis, I split the data frames into smaller data frames by position. Here is the code that I created to do so:

The print statements are pretty long, so if you want to see them, there is a link to the github page at the bottom of this post. From looking at the data, it became very clear that a player’s team performance had a lot to do with their ending ranking. For example, Jonathan Taylor was the first running back according to the ADP. However, he did not end up in the top five for the weekly dataset. The team he plays on, the Colts, had a bad season. We also see this trend from the number one wide receiver, Cooper Kupp, the number three tight end, Kyle Pitts, the number five quarterback, Kyler Murray, and many more. The data suggests that players who are on a good team perform better than others. Therefore stakeholders should decide which teams they think will be good and try to limit their player selections to those teams.

For the extension of this assignment, I added a few things. First, I will be doing all the analysis in the original assignment and the extension on three years of data, 2020, 2021, and 2022 instead of just 2022.

Next, I have added clusters to create tiers of players for each position. In the code, I have written a function that has one parameter, the name of a position. The function then creates 10 clusters using k-means analysis, which represent the tiers of players at that position. Then, I order the clusters by the average amount of points for the players in each cluster. Finally, I print out the top five clusters for that position. Below is a screenshot of the function:

Since there are a lot of prints that show the results of the clusters, if you want to see them, you can get them by going to the code in the repository. To sum it up, surprisingly the tiers with the consistently highest averages is QB. This is an important discovery because many people are under the impression that QB is less important than other positions in fantasy. Therefore if you can get one of the QBs who are in the top tiers, you will more likely have a successful season. Additionally, RB and WR are the next two most important positions according to the clusters. It is important to get some of the better players from those positions as well, because it takes more than one player to win a championship.

Limitations and Considerations

Despite the insights garnered on drafting players based on their team, there are a few limitations to acknowledge. Firstly, when looking at the dataset in particular, the dataset spans only three seasons, limiting the range of findings across longer time frames. Additionally, in this analysis, I only used the data on one of the three seasons provided. Next, external factors such as injuries, team dynamics, and coaching strategies are not shown through the statistics given by the data. Finally, while the dataset provides a comprehensive overview, it lacks certain player attributes and game context, thereby limiting the depth of analysis.

Conclusion

In conclusion, the exploratory analysis of NFL fantasy football dataset offers valuable insights into the relationship between fantasy points scored by players and the strength of their own football team. The data suggested that stakeholders should try and draft players from successful NFL franchises. The clusters show that the position with the most value is QB, then RB and WR. The difference between RB and WR is minimal, but the tier data clearly shows you need to have a top QB to win. By discovering the correlation of drafting players from good teams and the need for a QB in one of the top tiers, fantasy football team managers will be able to gain actionable intelligence to improve drafting strategies, optimize player selections, and ultimately elevate their teams performance.

GitHub Repository

Here is the link to the GitHub repository:

https://github.com/DrossTheBoss/INST414-FinalProject