Imgur michaelscotchtape from The Office

Are Shot Attempts and Shots on Goal Alone Meaningful Predictors of NHL Game Outcomes? Not Really.

One thing is for certain when watching a hockey game, there will always be fans yelling at their TV demanding players to shoot the puck. Do they know something the professionals don’t? Let’s take a look.

Christian Lee
Hockey Stats
Published in
6 min readDec 24, 2020

--

What we will cover

  1. The relationship between shot attempts (SA)/Corsi and shots on goal (SOG)
  2. Their association with wins
  3. Statistical evaluation using logistic regression

Data

We will take a look at the 2018–2019 regular season home game data scraped from nhl.com/stats. We won’t cover that process here, but follow this link to learn more. We are strictly using home games to eliminate redundant data from the same game, leaving us with 1271 records.

The relationship between Corsi and SOG

Corsi/SA is a measure of total number of shots during even-strength play. SOG refers to any shot that actually hits the target at any point during the game. Therefore, one of the major differences is that SOG includes power play and short-handed minutes, which typically provide high pressure and scoring opportunities that can quickly change the direction of a game.

Below, we will consider CF% (Corsi For %) and SOG%, which are simply the number of SOG and SA divided by the total number of SOG and SA per game, respectively. Therefore, a number above 50% indicates that a given team had more SA or SOG than their opponent.

For the most part, the smoothed curve is a straight line that closely follows the red line of slope 1, indicating that as one variable increases, so does the other by the same step. As expected, SOG% and CF% are highly correlated.

The slope of the blue line indicates that an increase in CF% is matched by a slightly smaller increase in SOG%; the bottom left quadrant shows that while a team has a small CF%, they tend to have small, but slightly higher SOG%. We see the opposite trend at the top right quadrant where teams have a high CF% and a slightly lower SOG%. This just means that CF% can be a bit more extreme than SOG%.

How do Corsi and SOG relate to wins?

Since both SA and SOG are considered proxies of offensive pressure, we would expect the team with more SA and SOG to win the majority of games. Whenever I watch games, I closely follow the shot count because it feels like it mirrors the flow of the game. However, the violin plot reveals some surprising results; for example, when the home team loses, they actually tend to have slightly more shot attempts and shots on goals than their opponents. Therefore, when the away team wins, they actually tend to have slightly fewer shot attempts and shots on goals. In some extreme cases, the losing team had ~70% of SA or SOG yet still lost. The opposite also stands.

For SOG%, the median is very similar between wins and losses, and hovers just above 50%. Therefore, the home team tends to have marginally more SOG than the away team, whether they win or lose. For CF%, the mean is significant higher (p=0.008 Welch) for losses than it is for wins, indicating the home team generally has a higher fraction of shot attempts than their opponents when they lose instead of when they win.

For the most part, the distributions are closely centered around 50%, suggesting that both SA and SOG, when considered alone, do not appear to be overly informative of game outcomes. So, they fail the eyeball test, but now, let’s take a statistical approach.

Statistical analysis using logistic regression in R

After splitting our data into training and testing sets and running the logit model we get the following results:

summary(glm(wins ~ cfp , data = df_train, family = "binomial"))$coefficient#               Estimate  Std. Error   z value     Pr(>|z|)
#(Intercept) 1.51263090 0.409942669 3.689860 0.0002243778
#cfp -0.02664262 0.007954432 -3.349405 0.0008098529

The intercept has a coefficient of 1.51 and is statistically significant. We interpret this to mean, in a hypothetical situation in which Corsi For % (cfp) is zero, the log odds of winning is 1.51 and the odds are exp(1.51) = 4.53. Therefore, in this hypothetical situation, having all shots against your team and none for your team actually corresponds to higher odds of winning the game (at least in the 2018–2019 season).

We also see that Corsi For % has a negative coefficient supported by a significant p-value. More specifically, for every percent increase in Corsi For %, we see a 0.027 decrease in the log odds of winning the game, or a 0.97 decrease in the odds. Therefore, as Corsi For % increases, the odds of winning an NHL game fall. This is counter-intuitive but it corroborates what we observed previously in the violin plot where losses had a higher median Corsi For % than wins.

summary(glm(wins ~ sp , data = df_train, family = "binomial"))$coefficients#               Estimate  Std. Error    z value  Pr(>|z|)
#(Intercept) 0.020386223 0.387003007 0.05267717 0.9579891
#sp 0.002697823 0.007497509 0.35982932 0.7189748

Here, SOG% has a small, positive coefficient but it is not significantly far from zero. Again, this is reflected by the violin plot that shows similar distributions between wins and losses. Therefore, Corsi For % is more informative for this analysis and we can drop SOG%.

Is our simple CF%-based model useful for making predictions?

mod = glm(wins ~ cfp, data = df_train, family = "binomial")
df_pred = data.frame(predict(mod, df_test, type = "response"), stringsAsFactors = F)
df_pred$prediction = factor(ifelse(df_pred < .5, 0, 1))
confusionMatrix(df_pred$prediction,df_test$wins)
# Reference
#Prediction 0 1
# 0 41 39
# 1 72 103

Assuming a probability ≥ 0.5 is a win and < 0.5 is a loss, our model correctly predicted 144/255 (56.5%) of game outcomes from the test set. Considering there is a 50% chance of randomly guessing the right outcome, using Corsi For % provided a small yet significant improvement. This supports the idea that more shot attempts alone (during even strength) is actually a predictor of losing rather than winning an NHL game. This could and should be further substantiated by incorporating more records from past seasons.

Conclusion

Prior to this analysis, I expected SOG% to be more informative than Corsi For % since the former incorporates power play minutes and is perhaps a slightly better measure of scoring attempts. However, we observed that SOG% is, on average, similar between wins and losses, therefore providing little predictive value.

I also expected there would be a positive correlation with wins, instead, we measured a significant negative correlation for Corsi For % and no relationship with SOG%. There exist endless explanations for these observations, but a hole in our analysis is that we’ve only considered SA and SOG alone when there are in fact many other variables at play. Perhaps interactions with additional variables could greatly improve the model performance. To be discovered…

Edit 06/2021: Another great point, raised in the comments, is that teams that are winning games often shift towards a more defensive style of play. As a result, the winning team may have had a higher fraction of shots in the beginning, but that percentage decreased as they began to focus on holding their lead. Likewise, the opposing team may have adopted a more aggressive style and started to get more pucks on net.

Code Availability

The R code used for this analysis can be found here.

--

--

Christian Lee
Hockey Stats

Medical student. Computational biologist. Sport stats enthusiast.