Supervised Learning with Fantasy Football

Elliott Bauer
INST414: Data Science Techniques
4 min readMay 2, 2024

For my Module 6 Assignment, I thought it would be interesting to examing Fantasy Football data. With this data, my plan was to create a linear regression model that could accurately predict the fantasy points of a player for the following year. To further hone in on my data, I decided to only focus on running backs. This is a position that interests me the most due to their fantasy value, as well as my experience playing in high school. I acquired my dataset off of Kaggle. The set contains an abundance of information from Fantasy Football (Points per Reception (PPR) format, excluding kicker and defense/special teams positions), ranging from 2017 to 2022. I decided to separate 2017–2021 and 2022 into two separate data frames, acting as though the 2022 data would serve as my future data. This way, I could check how accurate my model was by comparing it to the actual statistics from 2022. A question that could be answered by this data could be, “who are the most under-the-radar fantasy players to look out for in upcoming seasons?”, or “what spot in the draft should I aim to go for this player?” Stakeholders would simply be people who are into playing Fantasy Football, whether that be beginners or seasoned veterans. My ground truth labels were developed by using an algorithm that would predict the number of fantasy points scored basing it off of the averages of a player’s previous seasons. My data set contained the following fields:

  • Rk: Rank of the player for that fantasy season
  • Player: Player name
  • Tm: Team name, expressed as a 3 letter abbreviation
  • FantPos: Position played
  • Age
  • G: Games played
  • GS: Games started
  • Cmp: Completions
  • Att: Attempts
  • Yds: Yards
  • TD: Touchdowns
  • Int: Interceptions thrown
  • RushAtt: Rush attempts
  • RushYds: Rush yards
  • RushTD: Rush touchdowns
  • Tgt: Targets
  • Rec: Receptions
  • RecYds: Receiving yards
  • RecTD: Receiving Touchdowns
  • Fmb: Fumbles
  • FL: Fumbles lost
  • PPR: Points scored
  • PlayerID: Unique identifier for each player
  • PosRk: Rank at their position
  • Year

This data is relevant to my research question because it provides a ton of historical fantasy football data that I can do a lot with. I decided to use a regression model, as I thought using a line of best fit would be a great way of showing outliers who perform better and worse than expected in a given season. My model is predicting numerical features, as it is aiming to predict what a player’s performance will look like in the form of fantasy points. Below is an image of a basic graph that demonstrates my model. The x axis is the actual scores of running backs in the 2022 NFL season, while the y axis is the prediction from the model, based on the statistics from 2017 to 2021.

As you can see, there are several outliers on this graph that do not fall close to the line. I have added numbers above the plot points on the graph to further identify which points are accurate versus inaccurate. For one, point number 12 is the immediate outlier that I see when examining the graph. It is a player who was expected to get more fantasy points, around the 300 point mark, than they actually achieved. Their actual point total was in the low 200s range. One the opposite side of things, number 14 on the graph performed significantly better than he was projected to. His actual PPR total was over 150, while he was projected around 75. 3 other outliers on the graph are numbers 0, 5, and 22, to name a few. I have noticed that the bottom of the graph appears to be much more accurate for the model, as there are much less of discrepancies. This could be due to the fact that these players were projected to be backups, therefore they did not play as much as a normal running back might. This brings me to the limitations of my research. It does not account for injuries, which could justify why so many of these players did not perform up to par. For example, player 0 was projected around 100 points, when in reality it appears that he scored close to zero points on the season. This could have been a result of an injury, or possibly a suspension. Another limitation of this analysis is that is looking strictly at running backs. If I expanded this analysis to other positions, the results could look different. I would like to examine how quarterbacks, wide receivers, and tight ends perform with this model. I cleaned up my data by reducing the table down to only the columns that I was planning on using.

--

--