Using Empirical Bayes Estimation to Define the Best NHL Shooter
A statistical analysis of the NHL players in the 2019–20 regular season
When we start to analyze the rate of success is not uncommon to cross with cases like 1/1 and 0/1, i.e., a hundred percent of success or zero percent. With NHL regular season nearing the end, I asked myself, who was the best goal scorer this season. I know David Pastrnak has 48 goals at the time I’m writing this, and when I finish he probably will have scored 2 more, as will Matthews and Ovechkin. But, these guys usually have their line (first line) and play huge minutes in the game. For sure more than any teammate. This makes sense, better players always play more, be in hockey, basketball, baseball, etc. With that question in mind, I started looking scientific theory to based my judgment, at least to where I know science has no bias, don’t cheer for the Bruins nor the Leafs or the Caps.
Amid my quest, I came across this article about Empirical Bayes Estimation by David Robinson¹. Who describes how to use it to improve the expectations on each player batting average in the MLB.
Mathematical Background
The method consists in finding a probability density function (PDF) who fits the real data, i.e. given the values x₁, x₂, …, xₙ, there is a function f with the unknown parameters θ which is capable to give with the largest likelihood the values x₁, x₂, …, xₙ.
Because θ is assumed unknown, we also write f as a function of θ. Now since f(x₁, x₂, …, xₙ|θ) represents the likelihood that the values x₁, x₂, …, xₙ will be observed when θ is the true value of the parameter, it would seem that a reasonable estimate of θ would be that value yielding the largest likelihood of the observed values.²
Once we going to use a beta distribution. So θ~θ(α,β). So the idea is very simple, with going to find the estimators α and β with the data that we have so far in this season. In the end, we expect to obtain an ordered list of players with the better-normalized shooting percentage. This new list of values is also called posterior density function and can be defined as, f(θ|x₁, x₂, …, xₙ), i.e. given the parameter θ, a function f which has a maximum likelihood to yield x₁, x₂, …, xₙ.
So here is the list of who is currently the best scorers in the league, the thing to notice is the rate of success or how many attempts were made (shots) to score a goal. Well, the statistics tell it is, but in the back of our minds we know, it isn’t. To adjust this, we will use the Empirical Bayesian shrinkage towards a Beta prior, the name looks fancy but you’ll see that’s quite easy. First we collect the stats for all the players active in the league this season.
First, we can get the average shooting percentage in this season with the following and filter with a threshold of 100 shots taken, although this is kinda strange, it is necessary for the method to be able to compute our α and β. Another methods can use the whole data to fit the PDF, but they are more complex, we are sticking we simplicity.
Now, using the fitdistplus package in R. We going to fit a density function that can represent our data, this time we will use a beta distribution.
fitpdf <- fitdistrplus::fitdist(career$average, “beta”, method = “mle”)
Then we just have α and β.
And the nice thing about the fitdistplus is that if you plot the object. And it gives you some other plots like Q-Q, P-P and CDF.
Focusing on the first plot I want to point two things, once we do not have a large amount of data, our PDF does not fit perfectly. But right now it is our best option. Second is that we’re going to try to “push” the blue dashed line towards the red one, i.e. we will “shrink” the values in direction of the mean of the PDF.
To do this we just have to update the overall prior, mean of the PDF, with each player data. We already have the players data, the mean of the PDF is a bit more complicated and I’m not going to prove here, you can see it here. We just accept that is the expected value of the PDF, defined as E[θ|X₁, X₂, …, Xₙ] = (α /α+β). The new estimate shooting percentage will be given by:
SH% = (Goals + α) / ( Shots + α + β)
And with the following code, we can update and extract the best shooters in the NHL on the 2019–20 regular season.
career_eb <- career %>%
mutate(eb_average = (G + fitpdf$estimate[1]) / (SHOTS + fitpdf$estimate[1] + fitpdf$estimate[2]))
Results
Applying the above method we can reorder the ranking shooters and see what happened with average. A thing to notice is that none of those of the list above is the chart. Those players were “shrunk” to the average of the PDF and are nowhere near the top results.
Now we can see what the “shrinkage” has done to the real data. The dashed line is the mean of the PDF, and the continuous line is the x = y. As we can see, players with more shots are close to the continuous line, and the average has not changed.
This picture makes clear that a few numbers of events do not represent the average of the occurrences. What I mean is, that if you are lucky a few numbers of times or unlucky does not mean that you will continue like that. You can score 5 goals in 10 shots, have 5 hits in 7 at-bats, etc. But all events tend to move towards the mean.
Closing Remarks
The Empirical Bayes Estimation is an approximation to a more precise method, but has been used quite regularly because of the simplicity and the accuracy of the results. In the scientific medium it has been applied, in the engineering, medical and biologic areas. But this method can be used to adjust any case were the events follow the binomial distribution.
Using the fitdistrplus package in R the coding is reduced to a few lines and, and can be done quite easily. The most difficult part is the data wrangling.
As we saw, even a simple method can be a good way to obtain a nice estimation, and it’s not necessary to use more complex and difficult to implement methods.
References
- http://varianceexplained.org/r/empirical_bayes_baseball/, acessed in 03–05–2020.
- Introduction to probability and statistics for engineers and scientists. Sheldon M Ross, Academic Press, 2014.
Acknowledgments
I want to thank to the guys at quanthockey.com for the data.