Crunching the Numbers: A Data-Driven Approach to Predicting the Super Bowl Winner

7 min readJul 17, 2023

How to clean up datasets and use logistic regression in Python to make conclusions about America’s #1 sport

https://www.cnn.com/2023/01/21/sport/nfl-playoffs-preview-spt-intl/index.html

The Super Bowl, the grand finale of the National Football League (NFL) season, is more than just a game; it’s a cultural phenomenon that brings together sports, entertainment, and community spirit. This annual event, watched by millions worldwide, pits the champions of the NFL’s two conferences against each other in a winner-takes-all showdown for the coveted Vince Lombardi Trophy. The Super Bowl is a spectacle of athleticism, strategy, and resilience, where legends are made and history is written.

Predicting the Super Bowl winner is an intriguing exercise that combines statistical analysis with an understanding of the game’s dynamics. Historical season scores offer valuable insights, revealing patterns in team performance, player form, and tactical strengths. Analysts scrutinize these data, factoring in variables like home-field advantage, injury reports, and even weather conditions. However, the unpredictable nature of the sport means that, despite the most thorough analyses, the Super Bowl often serves up surprises, reminding us that in football, as in life, nothing is guaranteed until the final whistle.

This article’s goal is to provide a simple analysis of the NFL for those of us who might not be football know-it-alls or analytical geniuses. First, I’ll share a few additional data sources, then, we’ll use logistic regression to predict the winner of this year’s Super Bowl after data cleaning.

Data Sources

Here are some potential online resources where you can find NFL game results data for our machine learning project:

1. [Pro Football Reference] (https://www.pro-football-reference.com/): This website provides comprehensive football statistics and history.

2. [NFL Red Zone Stats] (https://www.fantasypros.com/nfl/red-zone-stats/te.php): This site offers detailed NFL play-by-play data.

3. [NFL.com API] (https://www.nfl.com/apis): The official NFL API provides access to live NFL game data.

We’re going to use data from this link: https://drive.google.com/file/d/1WDUd2hQ0YG7SUvbLaAdjSB4sdMhh9yCS/view?usp=sharing

Data Cleaning

The first step is to clean up and explore the data, in particular, to remove null values and select features we need for our prediction.

#import the data
df_input = pd.read_csv("/filepath")

A lot of our formatting will be organizing and aggregating columns, so it’s important to know what we’re working with:

#check column values
#check the number of games and team in the data
#check number of columns
print(df_input.columns.values)
print(df_input.game_id.unique().size, df_input.vis_team.unique().size)
print(len(df_input.axes[1]))

#find columns with "ERROR" values
for column_name in df_input.columns:
    if 'ERROR - abbrev_team' in df_input[column_name].values.astype(str):
        print(column_name)

#create dataframe that checks columns with "ERROR" in and removes those rows
df_error = df_input[df_input['team'].str.contains('ERROR - abbrev_team') | 
                    df_input['Team_abbrev'].str.contains('ERROR - abbrev_team') |
                    df_input['Opponent_abbrev'].str.contains('ERROR - abbrev_team') |
                    df_input['vis_team'].str.contains('ERROR - abbrev_team') |
                    df_input['home_team'].str.contains('ERROR - abbrev_team') |
                    df_input['Vegas_Favorite'].str.contains('ERROR - abbrev_team') 
                   ]

#we will remove these lines by checking their unique game id
games_tobe_removed = df_error.game_id.unique()

#actual removing of "ERROR" values, ~ or - denotes not
#check new size of num games (orig was 255 unique game id)
df_input = df_input[~df_input['game_id'].isin(games_tobe_removed)]
df_input.game_id.unique().size

#check null values, but they're irrelevant columns in our case so it is unnecessary to drop
for column_name in df_input.columns:
    if df_input[column_name].isnull().any():
        print(column_name)

Now that the data is cleaned, we will organize the data starting by finding the total score of each team, then the mean score, mean pass rating, and finally the win rate of each team (Note that the prediction doesn’t need these metrics, it helped better organize our dataset — skip over to predictions if you want to take a different route or are using a different dataset!)

Data Manipulation

#in order to do easily this, we create a new ID called TEAM_GAME_ID
df_input['team_game_id'] = df_input.apply(lambda x: x['team'] + '_' + x['game_id'], axis=1)

#vis_score and home_score need to be analyzed, duplicates dropped since only one row of the same values is necessary
df_scores = df_input[['team_game_id', 'vis_team', 'home_team', 'vis_score', 'home_score', 'game_date', "pass_rating"]]
df_scores = df_scores.drop_duplicates()

# extract vis_score for vis_team, home_score for home_team
# using a similar method, isolate teams only and create a new column
# then just keep team_game_id, game_date and score
df_scores['score'] = df_scores.apply(lambda x: x['vis_score'] if x['vis_team'] in x['team_game_id'] else x['home_score'], axis=1)
df_scores['team'] = df_scores.apply(lambda x: x['vis_team'] if x['vis_team'] in x['team_game_id'] else x['home_team'], axis=1)
df_scores = df_scores.drop(columns=['vis_team', 'home_team', 'vis_score', 'home_score'])
df_scores.reset_index(inplace=True)

#now we have to aggregate/add up the scores corresponding to each team - sorted highest to lowest
agg_scores = df_scores[['team', 'score']].groupby(by=['team']).sum().sort_values(by=['score'], ascending=False)
agg_scores

#another for mean score
mean_scores = df_scores[['team', 'score']].groupby(by=['team']).mean().sort_values(by=['score'], ascending=False)

#identify players with pass ratings and sort from highest to lowest
#use dataframe with already sorted game + team ids to find mean pass ratings for each team
#a lot of our prediction will rely on how good a team's passes are and cumulative points scored by the team
passes = df_scores[df_scores["pass_rating"] > 0].sort_values(by = ["pass_rating"], ascending = False)
passes = df_scores[['team', 'pass_rating']].groupby(by=['team']).mean().sort_values(by=['pass_rating'], ascending=False)

#one final important column addition before we start interpretating data: win rate
#the following calculates how many wins or losses a team has
def win_or_loss(row):
    if (row['vis_team'] in row['team_game_id']):
        score_delta = int(row['vis_score']) - int(row['home_score'])
        if score_delta > 0:
            return 1
        # count tie as 0
        else:
            return 0
    else:
        score_delta = int(row['home_score']) - int(row['vis_score'])
        if score_delta > 0:
            return 1
        # count tie as 0
        else:
            return 0

df_games = df_input[['team_game_id', 'game_id', 'vis_team', 'home_team', 'vis_score', 'home_score']].drop_duplicates()
df_games['win_loss'] = df_games.apply(lambda x: win_or_loss(x), axis=1)

#find win_rate
win_rate = {}
for team in df_games.vis_team.unique():
    df_tmp = df_games[df_games['team_game_id'].str.contains(team)]
    win_rate[team] = (df_tmp['win_loss'].sum() / df_tmp['win_loss'].count()) * 100
winrate = pd.Series(win_rate)

#combine passes, score, win rate data frame
#currently sorted by pass rating
passes["tot_score"] = agg_scores
passes["mean_score"] = mean_scores
passes["win_rate"] = winrate
passes

Predictions

Predicting the winner of the Super Bowl using machine learning can be approached as a classification problem, where the goal is to classify which of the two teams will win. There are multiple algorithms that could be suitable for this task:

1. **Logistic Regression**: This is a simple and efficient algorithm for binary classification problems. It can serve as a good starting point.

2. **Decision Trees**: Decision trees can handle both categorical and numerical data, making them a good fit for this problem. They are also easy to interpret, which can be helpful for understanding which features are most important in the prediction.

3. **Random Forest**: This is an ensemble method that uses multiple decision trees to make a prediction. It is more robust and accurate than a single decision tree.

4. **Gradient Boosting**: Like random forests, gradient boosting is an ensemble method. It builds multiple weak learners (usually decision trees) in a sequential manner, with each new tree trying to correct the mistakes of the previous ones.

5. **Support Vector Machines (SVM)**: SVMs can be effective in high-dimensional spaces and are versatile as different Kernel functions can be specified for the decision function.

6. **Neural Networks**: If you have a large amount of data and computational resources, a neural network could potentially achieve high accuracy. However, they can be more complex to set up and interpret.

The choice of algorithm depends on the specific characteristics of our data. In this article for demonstration purposes, we are using the simple approach — logistic regression.

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable — in this case, the Super Bowl winner. It’s a go-to method for binary classification problems (problems with two class values).

In the context of predicting the Super Bowl winner, we could use logistic regression to estimate the probability that a given team will win the Super Bowl based on various features derived from the game results of the current season.

Here is an outline of the steps for logistic regression:

**Feature Selection**: You first decide on the features (also known as variables or attributes) that you will include in your model. These might be things like a team’s average points per game, total yards gained, total yards allowed, turnover margin, and so on.

2. **Model Training**: You then use the logistic regression algorithm to train your model using your selected features. The algorithm will use the historical data to learn how each feature affects the probability of a team winning the Super Bowl.

3. **Prediction**: Once the model is trained, you can use it to predict the probability of a team winning the Super Bowl. For example, if you input the features for Team A and Team B, the model might output a 70% probability of Team A winning and a 30% probability of Team B winning.

4. **Interpretation**: The output of logistic regression is a probability that the given input point belongs to a certain class. In this case, the two classes are “will win the Super Bowl” and “will not win the Super Bowl”. The resulting probabilities are then mapped to a discrete class based on a threshold, commonly 0.5. So, if the predicted probability is greater than 0.5, the model predicts that the team will win the Super Bowl.

Remember, logistic regression provides a probability result, so the output should be interpreted as such. It’s a way to understand the factors that contribute to a team’s chances of winning, rather than a definitive prediction of the outcome.

Sure, I can provide a basic example of how you might set up a logistic regression model in Python using the `sklearn` library. Please note that this is a simplified example and doesn’t include all the steps you might need in a real-world scenario (like data cleaning, feature engineering, model validation, etc.). The cleaned data above should allow users to plug and play around with the model.

#import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Let’s say your DataFrame includes columns for each team’s average points per game (`avg_points`), total yards gained (`total_yards`), and a binary column `superbowl_winner` indicating whether or not the team won the Super Bowl that year (1 for yes, 0 for no). You’d split this data into a feature matrix `X` and a target vector `y`:

#note that "data" is a template DataFrame
X = data[['avg_points', 'total_yards']]
y = data['superbowl_winner']

#split the data into a training set and a test set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#train model
model = LogisticRegression()
model.fit(X_train, y_train)

#make predictions on test set and calculate the accuracy of the model
predictions = model.predict(X_test)
print("Model accuracy: ", accuracy_score(y_test, predictions))

This will give you a basic logistic regression model for predicting Super Bowl winners based on average points per game and total yards gained. In a real-world scenario, you’d likely want to use more features and spend more time on each step of this process, especially on feature engineering and model validation.

Thanks for reading!!

Crunching the Numbers: A Data-Driven Approach to Predicting the Super Bowl Winner

Data Sources

Data Cleaning

Data Manipulation

Predictions

Written by Sophie Ryan