Photo by Priscilla Du Preez on Unsplash

Identifying the Best Predictors of NHL Game Outcomes Using Random Forest

Dozens of different metrics are discussed by NHL statisticians and fans to better understand the game. Here, we develop a Random Forest model in R to pinpoint which stats are the most useful for predicting game outcomes.

Christian Lee
Published in
8 min readMar 24, 2021

--

Outline

  1. Defining the question
  2. Data scraping
  3. Data cleaning and preparation
  4. Random Forest implementation
  5. Assessing the model
  6. Conclusion
  7. Code and references

What are the best predictors of winning or losing an NHL game?

nhl.com/stats contains upwards of 20 different statistical categories each containing ~20 columns and thousands of rows. While there is some overlap between categories, there remains a surplus of variables, certainly too many to analyze one-by-one. We also do not want to consider variables in isolation because metrics capture different and/or complementary information.

Here, we will implement Random Forest to identify the most important features (metrics like shots on goal, faceoff wins, etc.) and learn a model that we can use to make predictions on new data. We will not be delving into the algorithm in great detail here, but this article is a good starting point. For now, just know that Random Forest builds hundreds of decision trees based on samples of features and observations, and then uses the average or majority vote to make final predictions. To help with the intuition, below is a single decision tree:

rpart.plot decision tree

X5v5.S..Sv is the shooting + save %, Net.PP is the net power play % and PK is the penalty kill %. The first split is based on X5v5.S..Sv, such that a team with a value < 98 has a 0.16 win probability, accounting for 41% of the total samples. On the other extreme, if a team has a score ≥ 102, then there is a 0.85 win probability with 40% coverage. A team with a X5v5.S..Sv between 98 and 102 and a Net.PP < 16 has a 0.30 win probability and 10% coverage.

Gathering the dataset

This is the longest and most difficult part of the process because nhl.com/stats only allows us to download 100 records at a time. Manually downloading all of the records would be overly time-consuming and error-prone. Fortunately, I have covered data scraping in R before in detail and the code used in this article is available at the end.

To gather game data from all teams, I selected “Teams” > “By Game” and deselected “Sum Results”.

Screenshot from nhl.com/stats

I decided to focus on these five reports: Summary, Faceoff Percentages, Miscellaneous, Penalties and Shot Attempt Percentages. There are several more available but these five seemed to provide enough coverage of interesting metrics. I also decided to include records from the 2017–2018 to 2019–2020 seasons to create a large dataset.

As a result, we were left with a 7415 x 105 data frame where each row corresponded to a specific game, home or away, and each column corresponded to a metric/feature.

Cleaning the data and adding new features

After preliminary filtering of missing data, duplicate columns, etc., the data frame was reduced to 6319 x 38. Below is a screenshot of the first three rows:

Some of these stats are interesting but should not be used in their current form. For example, X5v5.Sv stands for 5-on-5 save percentage. A value of 100.0, as in row two, means the Oilers did not allow a single goal during even strength play. Since this type of statistic contains information about goals during the game itself, this could lead to inflated prediction accuracies and obvious conclusions; hockey is won by outscoring your opponent, so teams are always attempting to maximize shooting and save percentages. Nothing new there.

Instead, I created additional columns corresponding to values from previous matchup(s) and previous game(s), where matchups are those against the same team. This allowed to us to test if knowing how a team performed previously provided any predictive value for the upcoming game. I also created two additional columns, OZDZ.FO_diff and Shot_diff, to represent the difference between offensive and defensive faceoffs and total shots on goal, respectively.

Random Forest implementation

I began by imputing NAs then splitting the data into a training and testing set. I used the out-of-bag (OOB) error to tune the mtry parameter, which specifies the number of features randomly sampled at each split. As we mentioned earlier, sampling features at each node is one way that Random Forest reduces overfitting and increases generalizability. If everything goes correctly, the OOB accuracy and test set accuracy should be similar.

oob_values = vector(length=20)
for(i in 1:20) {
print(i)
temp = randomForest(win ~ ., data=train,
mtry=i, ntree=500,
na.action = na.roughfix)
oob_values[i] = temp$err.rate[nrow(temp$err.rate),1]
}

17 proved to give the smallest OOB so that is what we will use in the final model. Another parameter that can be tuned is the number of trees, but for now we will just generate 1000:

mod = randomForest(win ~ ., 
data=train,
ntree=1000,
proximity=TRUE,
importance=TRUE,
na.action = na.roughfix,
mtry=which(oob_values == min(oob_values)))

Assessing the model

When using Random Forest, OOB error is a metric used to assess model accuracy. OOB comes from Bootstrap Aggregation (Bagging), a process in which samples are randomly sampled with replacement. As a result, for each tree in the Random Forest, there remains untouched samples that can be used for internal testing. Below are the results:

The average OOB estimate of the error rate was 30.36%, 31.10% for losses and 29.74% for wins (panel A). In other words, our predictions were correct 69.64% of the time, and marginally better for wins than losses. When I ran our model on the test set (panel B), we achieved similar results with an accuracy of 71.03%.

For hockey, ~70% is high accuracy. I suspect this is due to specific stats, like power play % and penalty kill %, that contain some information about goals or the lack thereof. Let’s take a look.

Feature importance

In both panels, we have the various hockey statistics on the y-axis and a measure of variable importance on the x-axis. Mean decrease in accuracy refers to the loss in accuracy after permuting/shuffling values of a given feature. Mean decrease in Gini refers to the decrease in node homogeneity when permuting a variable. In our case, perfect homogeneity would mean all observations are wins or losses following a split. For both measures, a higher number equates to more importance. Topping our lists are OZDZ.FO_diff, Net.PP, Net.PK, BkS, etc..

While these figures do indicate importance, they do not incorporate directionality. For that, we turn to SHAP plots which show the relationship of features with predictions:

In the SHAP plot, each point is an observation from the training data. The color corresponds to the original values, whether it is relatively high (red) or low (blue). For Net.PP, we see that the red dots are typically on the right/positive side of the x-axis while the blue dots show the opposite trend. Therefore, Net.PP is positively correlated with wins. Going into more detail, Net.PP is a given team’s power play goals minus conceded short handed goals as percentage of total power play opportunities. Therefore, a percentage close to 100 means the power play is highly productive, whereas a negative percentage means the power play is a liability. The same logic and correlation applies to Net.PK. This would suggest that NHL teams should spend more time honing their special teams and less time in other areas (lowly ranked features).

Another top features is OZDZ.FO_diff. A positive value means a team had more offensive zone faceoffs than defensive zone faceoffs. Intuitively, a higher OZDZ.FO_diff corresponds to more wins, however, we actually observed the opposite trend; teams that won typically had more defensive than offensive faceoffs. This is captured by the SHAP plot and the decision tree below.

Perhaps players and teams do well starting from their own zone because it provides an opportunity for a fast breakout. Or, maybe a team tends to have a negative OZDZ.FO_diff if they play a more defensive game. This could be the scenario if a team is winning a game and the coach wants to deploy a more defensive style instead of maintaining high offense. There are many explanations so this may even require a separate and dedicated post.

The statistics that we created, e.g. points_previous_X_games and points_previous_X_mathcups, rank towards the middle-bottom of the pack. This suggests that knowing how a team performed previously, in terms of points, offers little predictive value. This is particularly true when only looking back at a single game or matchup. Creating other variables, like goal differentials from previous games, could tell a different story and will be included in upcoming analyses.

Conclusion

We were able to scrape and clean data, create new features and implement Random Forest. We achieved ~70% prediction accuracy and determined that OZDZ.FO_diff, Net.PP and Net.PK were among the most important features.

In future work, I will rerun this after removing any features containing goal information. That will include removing PP, PK, etc.. Additionally, I think it would be worthwhile to incorporate more counts alongside percentages since the latter does not capture magnitude. For example, a SAT% of 60 could result from a number of combinations: 60 vs 40 SAT (difference of 20), 42 vs 28 (difference of 14), etc.. In some cases, how much a statistic varies between opposing teams may contain valuable information.

Finally, I also want to run another analysis that only considers prior information so I can make complete predictions on game outcomes. This will mean incorporating more features like points_previous_X_games, points_previous_X_matchups, etc..

Code

There are three scripts for this article that can be found here.

References

  1. Medium article on Random Forest
  2. Regression trees from UC Business Analytics
  3. SHAP values
  4. Youtube video from StatQuest
  5. Code from StatQuest
  6. NHL stats

--

--

Christian Lee

Medical student. Computational biologist. Sport stats enthusiast.