Predicting NFL Players Fantasy Performance

Eli Dross
INST414: Data Science Techniques
5 min readApr 27, 2024

Introduction:

In the world of data science, assessing similarity between data points is a fundamental technique with applications across various domains. In this post, I will be doing a linear regression model to predict NFL player performance. By employing methods from module six, specifically doing linear regression, I aim to answer a specific question, provide insights and make informed decisions leveraging the analysis of the model.

Identifying Questions, Stakeholders and Decision:

The question I have created to ask is: How can we predict the total fantasy points scored by NFL players based on their position, team, and average performance across the season?

For this question, the stakeholders are: Fantasy football enthusiasts, particularly those participating in fantasy football leagues or managing fantasy football teams.

The answer to this question will be able to inform fantasy football managers’ decisions when drafting players for their teams, making trades, or setting weekly lineups. It will help them identify undervalued players and optimize their team’s performance.

Description of the Data:

The dataset contains information on NFL players’ weekly fantasy points, their positions, teams, their average performance over the season and the total points each player accumulated over the season. Ground-truth labels, i.e, the total fantasy points are based on actual performance in NFL games, aggregated over multiple weeks. The source is from Kaggle, specifically the NFL ADP and Fantasy Points dataset webpage. Within this dataset, they have data from the 2020, 2021 and 2022 NFL seasons. For this assignment, just like assignment one and three, I am going to focus on the 2022 data, since during that season I had a losing record in all three of my fantasy leagues. This data is relevant as it is crucial for assessing player performance, identifying trends, and making data-driven decisions in fantasy football management.

Data Collection:

The data was obtained from a Kaggle dataset named “NFL ADP and Fantasy Pts (Fantasy Pros) 2020–2022.” The specific CSV file used is “weekly_data_points.csv” located within the “ppr/2022” folder. I used Python libraries such as pandas for data loading and preprocessing.

Classification vs Regression Model Selection:

I have decided to use a regression model because the target variable, total fantasy points (TTL), is a continuous numeric variable representing the sum of fantasy points scored by a player over multiple weeks. A regression model makes more sense for numerical data and a classification model makes more sense for categorical data.

Features in the model:

The features I used for the supervised model are:

  • Position number (pos_num)
  • Team number (team_num)
  • Average performance (AVG)
  • Performance in Week 1 (Week 1)
  • Performance in Week 18 (Week 18)

The analysis of the data collected from the NFL fantasy dataset, specifically focusing on predicting total fantasy points (TTL) scored by players, has been conducted using a linear regression model. The model was trained on features such as position, team, average performance, and weekly performance data. The linear regression model does a prediction on total player points for the season based on the features described above. Here’s a breakdown of the analysis and its implications:

Analysis:

  1. Model Performance Metrics:
  • Mean Squared Error (MSE): 952.56
  • Mean Absolute Error (MAE): 22.35

2. These metrics provide insights into the model’s accuracy and precision in predicting total fantasy points. The MSE measures the average squared difference between the predicted and actual values, indicating the overall model fit. The MAE measures the average absolute difference between the predicted and actual values, providing a more intuitive understanding of prediction errors.

3. Interpretation of Errors:

  • The MSE value of 952.56 suggests that, on average, the squared difference between predicted and actual total fantasy points is around 952.56. This indicates a medium level of variance between predicted and actual values.
  • The MAE value of 22.35 implies that, on average, the model’s predictions deviate from the actual total fantasy points by approximately 22.35 points. This suggests that the model’s predictions are, on average, within a reasonable range of the actual values but still exhibit a medium level of error.

4. Model Utility and Insights:

  • Despite the errors indicated by MSE and MAE, the model provides valuable insights into predicting total fantasy points based on player features. It enables stakeholders, such as fantasy football enthusiasts and team managers, to make informed decisions when drafting players, setting lineups, or making trades.
  • By leveraging features such as player position, team affiliation, and historical performance, the model assists stakeholders in optimizing their fantasy football strategies and maximizing their team’s performance.

Plot:

After finishing the code that ran the linear regression, I decided to create a plot to help visualize the strength of the model. Here is the code I wrote to create the plot:

Here is the plot:

The dotted line represents the actual total fantasy points, while the dots represent a prediction for a player. Looking at the plot, a majority of the predictions are very close to the actual points, though a few outliers exist.

Data Cleaning:

To make the data usable for linear regression, I replaced all “-” symbols and the “BYE” week in the dataset with a 0. I also filled all empty columns with a 0. I also created numerical values for the teams and positions, so that they could be used in the similarity comparison. I also shortened the data to only include the first 400 rows.

Limitations and Biases:

Despite the insights garnered on player similarity based on the weekly points data, there are a few limitations to acknowledge. Firstly, when looking at the dataset in particular, the dataset spans only three seasons, limiting the range of findings across longer time frames. Additionally, in this analysis, I only used the data on one of the three seasons provided. Next, external factors such as injuries, team dynamics, and coaching strategies are not shown through the statistics given by the data. Finally, while the dataset provides a comprehensive overview, it lacks certain player attributes and game context, thereby limiting the depth of analysis.

Conclusion:

In conclusion, the analysis of the NFL fantasy dataset using a linear regression model provides valuable insights into predicting total fantasy points based on player features. The model does have a few outliers, but overall it predicts most players very close to the actual number of total points. Despite some level of prediction errors, the model offers actionable insights for fantasy football stakeholders, empowering them to make informed decisions and optimize their team’s performance.

Github Repository:

Here is the link to the github repository, containing the code used to create this assignment: https://github.com/DrossTheBoss/INST414-Module6

--

--