Fantasy Football

Published in

Web Mining [IS688, Spring 2021]

23 min readMay 14, 2021

By: Jeremy Langenderfer, Wismy Seide, Nayana Kumari, Robert Rose

If you have never heard of Fantasy Football, or maybe you have heard of Fantasy Football and even participated before, you may think that Fantasy Football only became popular in recent years. However, this is not the case. Fantasy Football actually began in the year of 1962. Bill “Wink” Winkenbach, Bill Tunnell, and Gordon “Scotty” Stirling introduced the concept of Fantasy Football. Bill “Wink” Winkenbach, then a minority investor in the NFL’s Oakland Raiders, is credited with inventing the concept for fantasy football alongside public relations executive Bill Tunnell and sportswriter Gordon “Scotty” Stirling. The trio came up with the idea during a Raiders road trip while staying at a New York City hotel. The concept of fantasy football went public in 1969, when Andy Mousalimas, a bar owner and original GOPPPL member, launched a public league for customers at his Kings X Sports Bar in Oakland, California. Mousalimas is considered a major figure in fantasy football’s early growth. Again, for those who are not familiar with Fantasy Football, or for those who may have participated during Fantasy Football season, you are probably aware of how the popularity of Fantasy sports has increased. Trends show that this number has increased each year by over one million players. And to give an idea of just how much Fantasy Football has grown, in 2017, there were an estimated 59.3 million players who participated in Fantasy Football.

The Problem

Fantasy Football’s popularity has grown as trends show that this number has increased each year by over one million players. In 2017, there were an estimated 59.3 million players who participated in Fantasy Football. Therefore, our project is going to center on Fantasy Football. Each year during Fantasy Football season, the common problem users face is which NFL players to draft and in what order to draft those players. The Fantasy Football draft is a very strategic process where users compete against others in their own league and try to assemble the most productive team possible. Users will typically research how well players performed the year before and use this data in making an informed decision. Most Fantasy Football leagues require users/participants to pay a certain amount to join the league. The prize for picking the right team, you might ask? Some users/participants can win several hundred or up to one thousand dollars!

Based on the feedback provided, our group discussed the following ideas on what problem we’re attempting to solve. Ideally, we want to provide the Fantasy Football player with the best chances of having a successful Fantasy Football season. For example, there are a total of thirty-two NFL teams, which means there is at least one best player per team, per position, and those players would be available for a Fantasy Football league player to draft. Below, we list how we would categorize the players based on their historical statistics from previous seasons. This will provide the Fantasy Football participant with valuable information when choosing a player for their team. This historical data is the best resource for predicting how well a player will perform in the future seasons.

When a Fantasy Football player participates in the Fantasy Draft, they try to build their team and fill each position. We feel that providing a Fantasy Football player with pertinent information based on the top player's performance categorized by position will enable the Fantasy Football player to build the best Fantasy team possible.

Data Collection

During our initial research into this project, we planned to utilize the API located at https://api.nfl.com/docs/league/models/statistics/index.html. However, we found that this API would cost a significant amount of money for this service. Therefore, during our group discussions and research, we found another available resource that provided the data that will be useful in compiling the information necessary for Fantasy Football players to make an informed decision when it is Fantasy Football Draft season. This resource, which is listed under “Resources,” can be found at https://www.pro-football-reference.com/.

The data that we found useful from this resource were seasonal and weekly statistics from each NFL player that can be used in the calculation of determining how well a player performs in Fantasy Football leagues. We also can review past seasons, which will be useful in predicting how well an NFL player performs consistently. Another feature that may prove useful in analyzing how well certain NFL players perform against other teams. It is not uncommon for some players to struggle against certain teams for anyone familiar with professional sports or sports in general. This could be useful data for a Fantasy Football player in determining which players to have on their roster during the season. Meaning, if a Fantasy player has an NFL player who struggles against a certain opponent but is otherwise having a successful season, the Fantasy Football player may want to substitute another player in that player’s position to score more Fantasy points. From this resource, we collected all of this statistical data in a .csv or Excel workbook.

Ideally, we want to provide the Fantasy Football participant with the best chances of having a successful Fantasy Football season. For example, there are a total of thirty-two NFL teams, which means there is at least one best player per team, per position, and those players would be available for a Fantasy Football league participant to draft. Our goal is to explore the available data and provide different analyses that will hopefully benefit a Fantasy Football player in drafting the best NFL players for their team. The Fantasy Football player is trying to assemble the best or most Fantasy point-producing team possible. Therefore, the first exploration of the dataset will be isolating the 2020 season NFL players by position, which will also include their ranking in the year 2020 Fantasy Football season.

Data Analysis

For the first analysis, the goal here was to use the dataset from https://www.pro-football-reference.com/. This dataset contains data from 2020 back to 2010. However, to give an idea of what can be gleaned from this dataset, we will be focusing on just the year 2020. However, it is important to understand that we can adjust the below Python code to reflect the desired year or years that a Fantasy player may want to focus on. Our goal is to illustrate what is possible when using Python to analyze data and then interpret that data when participating in Fantasy Football draft seasons.

Before analyzing the data, it will be necessary to import the Pandas library in Python and ExcelWriter and ExcelFile, both from the Pandas library. These imports will be necessary for producing the final product in an Excel file for easier interpretation. Below is an example of how the imported libraries will appear. Now that the libraries are imported, the next step will be to read the dataset for this project. In this example, the dataset was stored locally within the same file path as this Python project. Below is the Python code for reading in this dataset.

After reading the dataset, it is important to make sure the dataset is read properly. We can do this by returning the column headings. When returning the column headings, we also observed that some columns weren’t necessary to analyze this data. So, in reviewing the dataset and columns within the dataset, we removed the following columns from the dataset (‘Team’, ‘Games.1’, ‘Scoring’, ‘Scoring.1’, ‘Scoring.2’, ‘Fantasy.1’). The below example will illustrate the Python code necessary to perform these steps.

The example below will illustrate all of the columns that existed before using the drop() function and the columns that now remain after the drop() function was executed.

Now that the unnecessary columns have been removed, we can start isolating this dataset to return the top thirty-two NFL players for each respective position. In this example, we are looking to find the top thirty-two NFL players for the positions of Quarterback (QB), Running Back (RB), Tight End (TE), and the top fifty NFL players for Wide Receiver (WR). It’s important to first look at how these players are categorized within the dataset. Upon review, we can see that the NFL player’s position is listed under the column “FantPos,” which stands for Fantasy Position. This is illustrated below.

Now that we have this information, the objective is to pull the top players by position from the “FantPos” column. The below Python code demonstrates retrieving the respective NFL position along with how many rows to display per each position. The final results are then exported to a new Excel file labeled for each respective position.

For the sake of saving space, below will only display portions of the Excel results that were returned for each respective position. The first result will be the Running Backs (RB) followed by Quarter Back (QB), Wide Receiver (WR), and finally, Tight End (TE).

Using the above information, a Fantasy Football player during the draft season can use this information and decide the best possible player for each respective position. Now, it’s important to keep in mind that nothing is ever a guarantee, but this information does prove useful when selecting NFL players. As you can see, it is very beneficial to see how NFL players performed during last season, as this is an indicator of how well a player is likely to perform during the next season. If a Fantasy Player wanted to expand on this information, they could use the above Python and pull in additional years and determine how well an NFL player performs on average. This would likely be a safer analysis, but the point here was to illustrate how isolating the NFL dataset to view each position would be helpful in a Fantasy draft, as the common problem from experience is which player/position to draft first.

Team Analysis

Based on our data, we were able to compile a team analysis. For our team analysis, we were able to drop unnecessary columns. Then we were able to rename those columns to the correct format.

Then we put each player's position in their data frame based on position.

Why is this team analysis critical. This analysis is important because it takes away the player’s popularity and strictly looks at the data on numbers. The fantasy points per position are captured. We can now get an average per team.

From our team analysis, we see that the Dallas Cowboys average the most fantasy points per team. This is crucial when drafting players for fantasy. If you have a player with equal stature, as most stats for players are usually compacted together, we can look at the output per team. So if I have a player with Dallas vs. a player with Washington with similar stats, I can make a team analysis decision to go with Dallas Cowboys because their fantasy points are usually higher.

Correlation

We wanted to know the correlation between usage and fantasy points. There are usually rare unicorns in football. A QB who throws 15 times but acquires 300 yards and three touchdowns vs. a QB who throws 40 times and has two touchdowns. Does usage increase the fantasy points? With our data analysis, we can find this information out. Once again, the name of the players does not matter. Oftentimes, we focus on the celebrities of the game. The household names of Rob Gronkowski, Aaron Rodgers, or Russell Wilson do not matter. These players are famous on and off the field. Seeing those names in a fantasy draft can entice someone to draft the familiar name they see. What they should focus on is ‘usage.’

We take out data and plot it for correlation.

Correlation between Fantasy points and Usage

It shows that increase usage and fantasy points are correlated. Players with little to no usage do not generate a lot of fantasy points. There is not a lot of players with over 15 usages per game. I believe that should be a factor when making these draft picks. I know that 0 to 5 is considered small usage. 5 to 10 is deemed to be medium usage. 10 to 15 is considered to be high usage. Anything above 15 is a considerable amount of usage. So it is not about popularity; it is about usage.

Age Correlation

Is age a factor in our data? The older the player, the more prone they are to injuries. Injuries lead to less production and, of course, the opportunity to miss out on a game. I believe age and games played are essential. Taking our data and making a simple scatter plot. We can see if this is true:

It appears that when players reach the age of 35, their games played begin to dwindle. This analysis is another factor we can deduce from the data. We can conclude that when players reach age 35 and older, they are more prone to miss games. The age factor is essential when drafting famous players.

VOR (Value Over Replacement) Analysis

We will now analyze the players based on their Value over Replacement (VOR). Simply put, this is a measurement of a player’s value as compared to a replacement player at a specific point in the season.

We will take the example of ‘Travis Kelce’ to illustrate VOR. Kelce consistently scored 20 points a game this season, while the other players were not even closer to his score. If we talk about Kelce, though he was injured for one week, our next best option (assuming we don’t have a TE on the bench) would have been to go to the waivers and grab the next best TE, which could be difficult.

So, a replacement, e.g., Thomas, got us 6 points. If Logan Thomas got us 6 points, and Kelce consistently gets us 20, Kelce’s VOR for that week is 14 (20 - 6). In other words, if we had started Kelce that week, we could have expected a solid 20 points. But, in this example, he was out that week, so we had to start the next man up, Logan Thomas, and he got us 6 points. In essence, our team lost 14 points because Kelce was injured.

We will use the concept of “next man up” to measure value for each player. This measure is how many points they would score in place of a replacement player. As we move along, we will see RBs, WRs, and TEs at the top. QBs will show up until the rankings are 25 around. The reason being Top QBs aren’t as valuable as WRs, RBs, and top TEs in the fantasy league.

It is obvious that this an estimated fantasy football value and not actuals. But the estimation is representative enough to help assess the VOR.

We will use python and pandas libraries for this analysis.

Libraries:

import pandas as pd

Data Source:

We are using the same dataset as the above analysis. The only modification at the source is collapsing two headers into one.

Before Collapsing:

Dataset After collapsing columns:

After the header column is collapsed, let’s take a look at the data.

We will set the pandas configuration for showing all the columns.

# set pandas configuration to display all columns
pd.set_option("display.max_columns", None)

# extract dataset from excel file 
data = pd.read_excel('Football_Fantasy_Data_v2.1.xls', sheet_name='2020')
df = pd.DataFrame(data)# display data properties
print(df.shape)
print(df.head(5))

Our analysis will look into the below columns to build the statistics.

“Fantasy FantPt”. Will Pick top 10 players based on average Fantasy Point.

Top 10 Players in terms of average fantasy points

“Receiving Rec”. The distribution sorted based on average Receiving Rec value

Players in Descending order in terms of Rec value

We will be including the columns which help derive the VOR value. Below are the columns.

# filter out unnecessary columns
df = pd.DataFrame(data, columns=['Player', 'Team', 'FantPos', 'Receiving Rec', 'Fantasy FantPt'])print(df.head())

We need to clean the data for NaN.

# drop rows with NA values
df = df.dropna()

We will have to format the below columns as integers simply because non-number formats cannot be used for this analysis.

# format these columns as integers df['Rec'] = df['Receiving Rec'].astype(int)
df['FantPt'] = df['Fantasy FantPt'].astype(int)

Fantasy Points are not enough as they are not normalized/standardized, so we create the column PPR Fantasy point, which will be used in the aggregation later.

# create a column for full PPR
df = df.assign(PPRFantPt=lambda x: x.FantPt + x.Rec)

The resultant dataset now has PPR Fantasy points calculated.

Now, we, of course, cannot use the entire dataset simply because the result may also include players having lower Fantasy points. To avoid this, we will put some thresholds based on what we see in the data. We will call it position cutoff.

position_cutoff = {
    'RB': 25, 
    'QB': 13,
    'WR': 25,
    'TE': 13
}

Now we have a dictionary with positions as ‘keys’ and ‘cutoff points’ as values. In a 12 man league, there are usually 24 RBs available at any point to start with (roster spots). This is a 2RB, 2WR, 1TE, 1QB league. The next best RB we would have to pick, given one of our starting RB’s is out, is RB #25 (again, we can probably see how this value model isn’t a perfect representation of reality). The #25 RB isn’t likely available to us and maybe on another opponent's bench. It’s important to understand that VOR is an estimate, and a player’s value may change based on our individual lineup and league.

Going back to our steps, we now split our data upon position, sort each position DataFrame in descending order, and find the cutoff player at each position. Once we find the #25 RB, #13 QB, #25 WR, and #13 TE, we find that player’s fantasy output for the year and append that to our ‘replacement_values’ dictionary.

This dictionary contains our replacement values for each position (an estimate of the number of points we could expect to receive given one of our starting players at that position was out).

We then get our replacement_values dict as a DataFrame and in the right position to merge, calculate a column called PPR_Value, and sort the table by this column in descending order.

replacement_values = dict()

for position, cutoff in position_cutoff.items():
    pos_df = df.loc[df['FantPos'] == position]
    pos_df = pos_df.sort_values(by='PPRFantPt', ascending=False)
    replacement_player = pos_df.iloc[cutoff, :]
    replacement_values[replacement_player.FantPos] = replacement_player.FantPt

We create a dataframe from the dictionary we just created for operational simplicity.

Then we merge the VOR dataframe with the actual dataframe

# make a dataframe out of the dictionary above
replacement_values = pd.DataFrame(replacement_values, index=range(0, 1)).transpose().reset_index()

replacement_values.columns = ['FantPos', 'Replacement']

ppr_vor_df = df.merge(replacement_values).assign(PPR_Value=lambda x: x.PPRFantPt - x.Replacement).sort_values(
    by='PPR_Value', ascending=False)print(ppr_vor_df.head(10).reset_index(drop=True))

Let's check the Result now.

We can see here that Alvin Kamara was the most valuable fantasy player all season. So, he will be voted as fantasy MVP and should be a first-round pick next year. Darren Waller at #7 is as expected too. He has not been as solid as Alvin, but he sometimes surprises us with a 12–15 catch game that could win us the week, given the options he has at TE.

Another interesting thing we can do is a group by the team and find the teams that provided the most fantasy value this season.

ppr_team_vor = ppr_vor_df[:100].groupby('Team', as_index=False)['PPR_Value'].sum().sort_values(by='PPR_Value', ascending=False)[:10]

print(ppr_team_vor.head(10))

As it appears, in terms of their fantasy values, the top two teams, the majority of the contributions, came from the top valued players.

KAN — Travis Kelce (Ranked number 3 in terms of VOR)

GNB — Davante Adams (Ranked number 2 in terms of VOR)

This analysis is valuable because it checks an important measurement of players by looking at the comparative study with its replacement. So, it eliminates the direct and apparent points in the past and reputation and looks at a pragmatic aspect. With frequent injuries to the game, it’s imperative to see a player compared to their potential replacements. While choosing as a team member for fantasy football, these statistics will give each player insight and reserve bench competitors.

Along with individual players ranking based on VOR, we can also use the team-wise ranking, which will help us understand the contribution each player is making to their team’s success/failure and each team’s potential.

College Pedigree

There are many stat points and pieces to consider when drafting a competitive fantasy football team; you’ve seen the stats, their experience, even the team they play for. But what if you’re looking at players with none or little of these? Or maybe you are between two players who seem like they are otherwise even. Maybe this is the point you consider another angle to try and predict their effectiveness on the field. Their college background.

This can be hard data to work with for people who do not pay attention to college football. Even the most casual fan probably knows who some of the top schools are, but how well does name recognition predict a player's effectiveness on the field?

The Tools

For starters, we’re going to use the following libraries.

The Data

For this analysis, I created a VLOOKUP in excel to take the relevant colleges from the player personnel table and insert them into the statistics table. This way, I did not have to load in two documents and combine dataframes.

The first thing I noticed is that there is a good amount of missing college data from our personnel table. So to the only factor in players with colleges, I dropped all rows whose College value was NaN.

Next, we create a new dataframe with only the data we want.

As you can see, this creates a data frame with only those columns from the original (Year, College, FantPt, etc.). Next, because we want to look at colleges AND fantasy output, let’s drop the null values for the ‘FantPt’ and ‘PPR’ columns. Now, since we only want the best players, let’s look at only the top 10 at their position. We hope to gather from this idea what schools put out the most productive and best players at their respective fantasy positions (QB, WR, RB, TE).

It’s also important to deduplicate the players this way, and you don’t skew the data with the one player who comes out of their school and dominates the league.

From here, I want to create some columns to help me figure this out.

This basically takes each college’s totals for Fantasy Points, their average, their count of players in the top 10. You’ll see I do the top 10 players twice, and this is because I chose to scale the data. From here, I now dropped all the duplicate colleges from the data frame.

I wanted to plot our average fantasy points with the top ten player counts to see what we were working with. Just by looking at this, I believe we can identify two, maybe three clusters.

But to be sure, I employed the elbow method. The elbow method is basically a method to iteratively help us determine the appropriate number of clusters.

I know that there won’t be more than 10 clusters, so we’ll limit our testing range to 10. Once our “inertia” begins to decrease and the graph begins to go relatively flat, we can identify our elbow.

According to this, our K appears to be 3. So we will cluster our colleges into three clusters. From here, let SKLearn predict what schools will go into which clusters and assign it our variable “predicted”, and then append a column to the data frame that will show what the colleges cluster is.

Now that we have our clusters, we assign each cluster a particular color and plot.

As you can see, we have three clearly defined clusters.

Our red cluster is comprised of the following:

Interpreting the Clusters

Looking at the graph, we can conclude a few things.

The green cluster is filled with colleges that either don’t put out too many players or don’t have high-performing fantasy players. This is a cluster we might want to avoid if we’re putting together a quality team.
The black clusters have high-performing players, but not a whole lot of them. This means maybe you get the unicorn player from some college you’ve never heard of. This might be worth a glance, but maybe not.
When drafting your next team, look closely at players that belong in the red cluster. This cluster is the one that has a good amount of top players AND a high average fantasy output. This is a cluster that seems to routinely output high-performing players, a lot of them.

3.a. The subsequent dataframe shows that Alabama and Stanford are the two top colleges for high-quality professional output. At the same time, Wisconsin, Clemson, Georgia, and the rest all compete for high marks in this area.

Bonus

You can apply the same process for specific positions as well; for example, let’s look at running backs.

Using the elbow method, we found k = 3. Filtering our initial data frames for Running Backs only, we get the following results:

Curious about wide receivers? For this position, there is a larger depth chart meaning more availability of top-quality talent. Let’s expand to the top 20 players by year and again apply the same method. In this case, the elbow method gave us k = 4. Note: In this case, some of the colors got flipped around. In this case, red represents the top schools.

Since drafting can be tough and players can be scarce, let’s look at how this might also change if we expand to the top 20 at each position. As you can see, Alabama still has the top spot, but Oklahoma does a significant climb.

Conclusion

In conclusion, we have provided numerous analyses when it comes to Fantasy Football. This information is useful to users starting a draft league, getting better at drafting players, or making bets for teams. Our analysis takes NFL data and turns them into a great asset for anyone who wants to leverage this information.

We have shown analysis for players and teams. We have included history and also the latest performance. We have created models to find correlations, Value on Replacement, etc. As more NFL data comes in, we can edit our analysis for previous seasons and provide a tool for the current season. Players have breakout seasons, so it would be essential to capture this information. We also have factored in the possibility of injury and how we mitigate that risk in our drafting of teams. We believe the above analysis would be a great tool for Fantasy football enthusiasts to help research and draft teams predict outcomes of the games and other aspects.

Limitations

One of the biggest limitations is what we call the X-Factors. X factors can be a player coming back from a year of injury. It also can be a change in team personnel that allows a player to thrive in a certain offense. Our data will never provide 100% accurate information because some players get better, some regress, and some stop playing altogether. We must continue to have different types of analysis for different situations. The weather would be a factor in how a QB plays, but the weather is not predictable. We can assume that certain QBs who live in northern states play better in the cold than QBs from southern states.

Clustering of colleges should only be used as a supplement to your overall analysis. Rookie drafting is usually a risky move regardless of pedigree, but it does help to know what school programs have the best track record of success.

For further analysis and study, our analysis reports can be extended to build a recommender system. Though the complexity of such a system would require more data points and a certain degree of x-factor to be dealt with, our reports will be a great starting point in that direction.

References

Barrabi, T. (2020, June 4). Who invented fantasy football? Fox Business. https://www.foxbusiness.com/sports/who-invented-fantasy-football.

The Lucrative and Growing Fantasy Football Industry. Sports Management Degree Hub. (2021, April 26). https://www.sportsmanagementdegreehub.com/fantasy-football-industry/.

NFL Developer Portal. (n.d.). https://api.nfl.com/docs/league/models/statistics/index.html.

Porter, J. W. (n.d.). Predictive Analytics for Fantasy Football: Predicting Player Performance Across the NFL. University of New Hampshire Scholars’ Repository. https://scholars.unh.edu/honors/406/.

Pro Football Statistics and History. (n.d.). https://www.pro-football-reference.com/.

Sports Data API Solutions: NFL API , NBA Data, MLB Data API. SportsDataIO. (n.d.). https://www.sportsdata.io/.

Willingham, A. J. (2020, December 5). Fantasy football is a billion-dollar pastime. Covid-19 is wreaking havoc with it. CNN. https://www.cnn.com/2020/12/05/us/fantasy-football-coronavirus-challenges-trnd/index.html.

Kasam, Anish. (2020, December 28). Fantasy Football Data Analysis with Python https://towardsdatascience.com/fantasy-football-data-analysis-with-python-b3c017d0d3b5

Nukish Philosophy (2020). Guide to Setting Up Python for Fantasy Football Analysis https://www.reddit.com/r/fantasyfootball/comments/en8xte/guide_to_setting_up_python_for_fantasy_football/

Codebasics. (2019). Machine Learning Tutorial Python — 13: K Means Clustering https://www.youtube.com/watch?v=EItlUEPCIzM

Appendix

Jeremy Langenderfer:
- Assisted with research on the topic of Fantasy Football.
- Assisted with research for datasets available for Fantasy Football.
- For each group project, I provided research/writing to each respective category assigned as part of my responsibilities.
- Project Presentation — compiled each group member's slides for final submission, provided research and narration for the slide topics assigned to me.
- Final Project — provided research and writing for my assigned topic areas, wrote the “Introduction”, “Data Collection”, “Data Analysis” sections.

Wismy Seide:
-Discussed ideas related to Fantasy Football.
-Discussed various topics and ways to utilize the dataset.
-Performed team analysis.
-Performed correlation analysis with usage and points.
-Performed correlation analysis with age.
-Contributed to all reports.
-Assisted with limitations and conclusion.

Nayana Kumari:
- Final Project Medium post:
— Data Preparation.
— Data Cleansing.
— VOR Analysis (Code + Explanation) for players.
— Added limitations and conclusion and further analysis potentials.
— Compiled each group member’s work and created the final Medium post.
- Project Presentation and reports — provided research and content for the slide topics I was responsible for.
- Contributed to research on Fantasy Football data and project objectives.

Robert Rose:
- Worked with the team to identify and refine the topic.
- Collected data from sources, pro football references, and sportsdata.io.
- Cleaned initial dataset.
- Combined source data in excel using VLOOKUP to join the data.
- Clustering analysis.