Trying to Predict NHL Game Outcomes with ML and Why It’s Difficult

Published in

CodeX

7 min readJun 6, 2022

Here I collect data from 17 NHL seasons beginning with the 2005–2006 season and ending with the 2021–2022 season. I extract the data directly from nhl.com using web scraping techniques (although there are a ton of better sites to get game data from). I clean the data and extract features from each team’s individual performance. I calculate a rolling average/summary of multiple statistics for a few different timeframes and compare those to their opponent’s same stats for each game.

As a baseline for my results, the home team won 8249 out of 16843 games during the time period my data is from. So if someone were to predict that the home team wins every game in the dataset, they would have an accuracy of 48.98%. My best model reaches an accuracy of 57.39%.

Previous Findings

In literature an accuracy of 61.54% was achieved by Gianni Pischedda from soccerlogic.com using ClusteR, a software developed for sports analysis. Those results can be read about here.

In a similar project GitHub user kn-kn was able to predict NHL game outcomes with anywhere from 53%-58% accuracy using decision trees and random forests and varying types of data attribute combinations.

In my analysis I use three different common ML models: logistic regression, random forest classification, and linear SVM. The best results thus far have been achieved using logistic regression with an accuracy of 57.3%.

Cleaning Data and Making Features

Below is the DataFrame I was to be working with, after reading in the data to Python and an initial bit of data cleaning:

In the DataFrame, each row represents one game; however, as you can see by inspection, for each game there are two separate observations-one for each game. The only way to idientify who was the home team and who was the away team was by using the ‘identifier’ column. If the identifier is ‘vs’ then the team in the ‘team’ column is the away team. If ‘@’ then the team in the ‘team’ column is the visiting team. In order to get the data in a way that I wanted I decided to split the DataFrame up by individual teams, which required that I make 33 separate DataFrames to calculate statistics on. Though inefficient, it was mostly a matter of copy-pasting code I had already written a bunch of times. In retrospect I might have written a function to avoid copy-pasting, but I had no idea the extent of what I was doing at the time.

For each team’s DataFrame, I calculated various rolling statistics based on their performance in past games. These statistics would be the basis of what I train my machine learning algorithms on, as it would be ludicrous to predict the outcome of each game based off of that game’s own statistics. My accuracy would have 100% accuracy in that case (ha). Dry humor aside, I calculated rolling statistics for the following statistics:

points
goals
goals against
power play %
penalty kill %
shots on goal
shots against
faceoff win %
save %

I calculate the average of all of these statistics over three different periods of time: a short window, medium window, and a long window of time. For the machine learning models I decided to play around with how long to make each window of time. Below are the different combinations I use for my models:

After having done this for each team, each DataFrame was concatenated into one larger DataFrame.

But at this point you might be saying something along the lines of “Colton, after doing this each game still has two columns that represent it. Why are you torturing us like this?!”

Why yes, that is quite a good point.

To fix this, I split the DataFrame up again into home team and away team DataFrames. These two are them joined together based on the date an opponent. Finally, we have a DataFrame that

a) has the data we want

and

b) has one row representing each game

Before applying ML algorithms to the data, I wanted to make this project about trying to predict whether or not the home team was the winner in each game. To do this, I would need statistics comparing each rolling statistic of the home team to that of the away team. After standardizing the values and finding the differences between home and away team statistics I was set up to start throwing algorithms around.

Applying Different ML Algorithms

Before applying different algorithms I initially set aside 30% of the data to be used as test data. The other 70% would be data used to train our models.

I was able to randomly select games to use as test data since my models were only to be based on time-series data that was already accounted for in each instance of our DataFrame.

Logistic Regression

Logistic regression is a good starting algorithm when deciding between two things. In this case we are going to be predicting whether the home team wins (1) or loses (0).

The data was trained on the training data, and results based off of testing on the testing data is below.

A sample confusion matrix is shown below along with the accuracy, precision, and recall for the statistics that were calculated off of the varying window time lengths discussed above.

From looking at the confusion matrix, the logistic regression algorithm is more effective at predicting the outcome of home losses (~60% eyeball accuracy) than home victories (~53% accuracy). From the precision statistic, the time windows (15, 25, 40) works best for use on the rolling statistics in our dataset. However the time windows (10, 18, 30) have the best overall performance in terms of performance statistics.

What’s great about the logistic regression model is that it’s simple and easy to train, unlike more complex ML models. Let’s take a look at a couple others.

Random Forest Classifier

Shown below are the results of a random forest regressor trained on the training set. The max tree depth was set to 5 and the number of estimators was set to 200. For better results using random forests, I might consider using a grid search to look for optimal parameters in the future. But for now, I am not too worried. Future improvement will be more dependent on the quality of data in my opinion.

The confusion matrix for the random forest classifier has a different look than the one generated by logistic regression. Again, our model performs fairly well when predicting home losses. However we only have slightly above 50% accuracy in predicting home team victories. This continues the trend of models performing worse at predicting home wins than home losses.

Like the previous model our model has the highest precision when using the rolling statistics based on the (15, 25, 40) time windows.

Linear SVM

The final algorithm I decided to test out was a Linear SVM classifier. Shown below are the results from the test set.

Funny enough, we get the same exact plot for the confusion matrix as we did for the logistic regression. This model performs well on the true home losses and poorly on the true home wins. The highest precision was achieved using the (20, 40, 60) window time frames, just barely edging out the winner from the other two algorithms, the (15, 25, 40) rolling statistic time frame.

Conclusions

I tested out some basic ML algorithms on nearly 17,000 NHL games played between 2005 and 2022. I achieved a peak accuracy of 57.39% using a logistic regression model with a rolling statistic timeframe of 15 games, 25 games, and 40 games to compare stats between two opposing teams.

The models tend to perform stronger when trying to predict true home losses rather than true wins by home teams. Further exploration into the data and the engineering of features will have to be done to figure out exactly why this is the case.

Overall, I’ve found that predicting outcomes between NHL teams is fairly difficult. Hockey is a traditionally low-scoring game, and performance of a team cannot be predicted solely on their competitors and statistics from previous games. This model fails to take into account things like injuries, player personal histories, suspensions, team ranks, and many other things. In short, making predictions for NHL games is hard.

Everything I use here is available at the GitHub repository here.

Originally published at https://cbarger.com on June 6, 2022.