Predict Super Bowl 2023 results using Random Forest

Sem Leontev
5 min readFeb 10, 2023

--

The NFL Super Bowl is only a few days away and I am actually feeling excited about it this year. As a curios person I cannot wait for a few more days to see the winner so I thought it’d be interesting to try to predict the winner beforehand. I used Python and RandomForest Classifier to predict the winner, and then use Regressor to make a guess on the score. Here is my way to predict the Super Bowl outcome.

the Data

First things come first. So I’ve started with the data. After some research and brief statistics overview I realized, it makes not much sense to look at the data other than the last season. Too many variables are changing over time. New players, new transfers, new strategies and injures. All this make game so unpredictable and interesting to watch. So, I used only 2022 season’s statistics to predict the Super Bowl outcome. Of course it left me with only 17 games for every team. Not much of a data if not say worse. But let’s see how far we can go with it.

The data I used includes every single game of the 2022 season. So, in total there are 532 rows of valuable information. I’ve used the next metrics for the analysis.

I’ve looked at 32 NFL teams over the last season and take core so here is the scatterplots for the team metrics for every game of the season.

The scatterplots show the correlation between any pair of metrics within the data frame.

Briefly looking at the data I cannot notice any linear correlations between team’s performance and scores they expected to gain during the game. I would assume that rushing yards are more significant, than passing yards and number of turnovers tend to correlate with a passing game.

The next thing to look at before moving with the prediction is the variance of the core metric. What is the score range I need to shoot for?

The total score is is distributed around 44 pts per game

Distribution of the total pts per game

And the Score for winners and loser is around 28 and 17 pts respectively

Distribution of the pts for winners and losers
The histogram of the scores gained for winners and losers

Target Teams

As I’m lucky enough to see the Super Bowl starts in a couple of days, I already know, which teams are going to compete for the trophy. Having such a great help, the prediction should be as easy as choosing between A and B. So, let’s look closer on the teams.

Kansas City Chiefs, who won the Super Bowl twice in their history. The last time in 2019. Not so far time ago.

Philadelphia Eagles just won once in their history, which was in 2018. Ok, looks like this won’t bring us close to choose the game favorite. Let’s look at the teams’ performance during the past season.

It seems like both teams are similarly strong. The only thing I can notice, is that Phila tends to gain more Rushing Yard per game, as Chiefs are better in passing. Ok, let’s see how far we can get with this data.

RandomForest Classifier to predict the winner

At first Let’s split data into Train and test datasets. That means even less data to train the model, but at least we are able to verify if we can predict something or not.

Split the data into train and test datasets

When choosing predictors to build my model I have to ignore such factors as Home/Away game and Game number in the season as it’s the same for both teams.

Based on the data RF can make a prediction with 73.6% accuracy rate
The precision score is 0.731

Here are we go! The model we’ve built can predict the winner in 106 out of 145 games amd the loser in 98 out of 132 games. Which gives us the precision score of 0.731. Absolutely not bad for such a limited data.

And here is the moment of Truth. Who will win?

Eagles!

Ta-dam. Chiefs can loose the game with the , which means a win for Eagles with 73.6% accuracy.

RandomForest Regressor to predict the score

Moving forward we may want to know the potential outcome of the game. And here we‘ll use RandomForest Regressor to build the model for nominal data.

Ok, seems like predicting the score is a more complicated task even for advance ML models. Here we’ve got only 62.16% accuracy rate. Ok, let’s see where it’d bring us.

Using the core metrics for every team we’ve got the prediction of..

RandomForest Regressor to predict the score

24–31. Eagles wins!

Sumary

There we have it! We now have a forecast that takes into consideration seasonal stats, teams performance and predict the winner of the Super Bowl game. Of course, the probability of the final score is pretty much less than 100% and I cannot say I’d bet for it. But this outcome seems to more likely to happen than any other. This model can be developed further, adding more factors into analysis, like: Quarterback’s stats, Injures report, Weather conditions and so on. So, later on we will try to bring on a more accurate prediction.

Links

--

--

Sem Leontev

Data Analyst | ML, AI, Python, R, SQL, Statistics, SPSS, GCP, AWS