NHL — Using Python and Advanced Analytics to Evaluate Player Ratings

Ethan Kennemer
10 min readMay 18, 2023

--

Basic Overview:

Over the years, I frequently questioned how player ratings were determined when playing EA Sports NHL. The overall rating, which often ranges from 70 to 100, is determined by adding numerous characteristics specific to each individual, according to the player’s biography. For instance, several measurements may be utilized, such as speed, acceleration, shot power/shot accuracy, stick handling, stamina, aggressiveness, and intensity. Yet, in practice, how can you gauge aggressiveness, intensity, or even stick-handling prowess to an extent that you can compare scores to that of others? There is not much information available about the EA Sports ranking system. However, it made me question if I could create my own rating system using only reliable player metrics (SPR).

Data Sets:

To start, traditional and cutting-edge player measures are combined for the final data set that was employed for this study. Traditional statistics focus on factors like goals, assists, total points, plus-minus, penalty minutes, and time on ice. Advanced player analytics are more concerned with player behavior and puck possession. I scraped the advanced metrics using Python’s library while the traditional data was obtained from https://moneypuck.com/stats.htm. The data I produced regards approximately 2,000 players between the years 2013 and 2023. I would like to emphasize that I solely took advanced statistical data into account when the game was at even strength. The data was prepared by removing any missing information before being sent to Pandas for analysis. This ranking system relates to both skaters and goalies despite their roles and contributions being vastly different.

Model Concept:

I am aware of the other player rating methods, and my proposal expands upon those ingenious grading schemes. In essence, one takes separate factors, gives each one a weight, and then adds them. The model is just the sum of the various weighted parameters. Identifying which parameters to utilize and what weight to give each is the most difficult task. The choice of weights for each parameter is essentially subjective. Although this does not sound like a horrible concept, there is a lot of uncertainty surrounding those ratings. The question is are they truly based on actual measurements, or are they miraculously determined by “the eye test.” Are the ratings an average across several seasons or do they depend on the prior season? Does their algorithm alter annually? For me to take that path, there is simply too much mystery. As a result, I developed my own weights, which appear to function adequately.

Adding up all of the chosen factors is an excellent place to begin, but it does not imply that the outcome is continual in space. For instance, when will a player reach the limit of 100? The ratings from the prior season may become antiquated if someone reaches the designated summated 100-score in the following season. With that said, we can utilize a Sigmoid function, which is illustrated in the formula below, to solve this continuity issue:

The probability P is a function of the input variables x, where z is the multinomial expression represented by a constant beta 0, weights w, and input parameters x for all-inclusive values from i=1 to N. Below, this is illustrated graphically:

The S-curve, which serves as a probability curve and is asymptotically near 0 and 100, passes across the y-axis at 50. To detect dichotomous categorization of a dependent variable given an array of independent input parameters, the Sigmoid function converts the log of the odds space to probability space. This function is typically utilized in machine learning logistic regression applications.

We can still utilize the traditional Sigmoid function by sending the multinomial combination of parameters through if we already have fixed weights and are not trying to categorize. The sensitivity to the y-axis decreases as values approach +∞ and -∞, as shown in the preceding graph. The players’ ratings will, in essence, range from 50 to 99.9, as the total of their individual weights should be larger than zero.

Terminology for Advanced Hockey Statistics:

I assume readers are familiar with the meaning of terms like Points, Average Time on Ice, Block Shots, Give-away, Take-away, Hits, Penalties Drawn, Penalties Taken, Faceoffs Won, and Faceoffs Lost. However, I will discuss the purpose of the higher-level metrics you all may be unfamiliar with.

CF%: The term “Corsi For” (CF) refers to the number of shots that a player’s team produced while that player was on the ice as opposed to “Corsi Against” (CA), which refers to the number of shots that the opposing team produced while that player was on the ice. Simply CF / (CF + CA) yields CF%.

CFQoC: Average CF% for Competition Quality. The degree of competitiveness rises as the number increases. A statistic whose meaning is somewhat disputed, but I prefer it when it’s used in conjunction with OZS.

CFQoT: This is a player’s linemates’ CF%. In comparison to his teammates, it shows how much a player contributed to the overall play—usually a reliable predictor of whether a player improved the players surrounding him.

OZS: Offensive Zone Starts — how frequently a specific player began their shift in the offensive zone.

PDO: Its name, which was suggested by hockey blog writer Vic Ferrari, is not an abbreviation. It is only a stand-in for luck. When a player is on the ice, it is the player’s team’s shooting percentage plus the save percentage.

xGF%: The proportion between expected goals for and expected goals against. Expected goals are simply the probability that a goal will be scored from a specific shot. It renders a verdict on the caliber of the shots. Therefore, a value of 0.5 for xG indicates that 50% of the time the shot should result in a goal. Basically, players that score highly here are involved in opportunities of the highest caliber.

DPS: Defensive Point Shares — one of the three measures developed by https://moneypuck.com/stats.htm applying the concepts of marginal goals for and marginal goals against, it assigns a score to each point a player obtains. Lastly, defensive forwards and D-men have more DPS than the majority of offensive forwards since acquiring DPS is more difficult than acquiring OPS.

Algorithm:

Before stacking the various parameters (GSS_adj) and sending them through the Sigmoid function, the method and weights I employed are shown in this piece of Python code:

It would be laborious to attempt to fully describe the concepts underlying the algorithm in this post because they are rather complicated. Nevertheless, I will touch on a few points:

  1. As a metric for success, I utilize points. I do not distinguish between goals, first, and second assists. I believe that productivity is defined as any action that advances a goal, regardless of where you were or what you executed. The circumstances in which a goal was scored are crucial and vary.
  2. The weights are determined on average, the majority of the weight (50%) in the stacked “GSS_adj” comes from points (productivity), “defn” accounts for roughly 35% (responsibility), “toi” accounts for 8% (stamina), and the other variables make up the remainder (miscellaneous).
  3. We only take into account players who have played at least 10 games in a season.
  4. Based on forecasted statistics through 82 games, players are rated.

Player Ratings:

A player who has played in all three years will have their individual yearly ratings weighted as 50%, 30%, and 20%, respectively, for the prior year, two years ago, and three years ago, in the following overall ratings (SPR), which are an average of individual ratings over the last three years (2020, 2021, and 2022).

The overall rating weights for players who have only participated in two of the previous three seasons are 60% for the most recent year and 40% for the oldest year. There is no weighting for players with fewer than three seasons. Given the graph in the “model concept” section, several of the names that appear here are not surprising.

EA Sports Comparison:

Although EA Sports ratings are not always accurate, they inspired me to create my own system initially. Therefore, it makes sense to compare my ratings to those of EA Sports. I would like to emphasize that I am not attempting to duplicate EA Sports, but we need a baseline to contrast with the scatter plot below, which displays EA Sports player ratings as a function of SPR for each player with a rating higher than 73. An R2-value of 0.76 indicates that 76% of the variability in EA Sports ratings is explained by SPR, and there is a robust and statistically significant connection (r = 0.87, P-value = 0). The most applicable line has an intercept of 32.5, a magnitude of 0.60, and an RMSE of 2.6. Whether values are small or big, there is a minor degree of homoscedasticity. In general, SPR tends to undervalue EA Sports below ~82 and inflate anything above. Despite this, there is still a strong correlation.

Examining the correlation between the distributions of each rating population is another approach when considering the comparison. SPR and EAS distribution plots, as well as KDE curves, are displayed below. With a bimodal structure, both distributions are comparable in form, with the exception that the SPR is wider. The statistical significance between the two means difference can be tested. This allows us to examine whether or not the SPR-EA Sports distributions are indicative of the same distribution. The significance level for the P-value was set to 0.05 (5%) in this case:

The likelihood of finding a result similar to the one that was seen if the null hypothesis were true decreases with a lower P-value. The two distributions do not fit the definition of a normal distribution.

The Central Limit Theorem (CLT), fortunately, allows us to select random samples from each distribution. According to CLT, if an adequate amount of samples are obtained and the selection method is arbitrary, assuming that each sample collected is distinct, a hypothesis test may be applied to the generated normal distribution, which will be centered on a mean value. This is seen in the example below, where 500 random samples are collected from each distribution and their means are computed. To assure normalcy, this process occurs ~527 times.

Now, we can conduct a hypothesis test since we have two normal distributions. We use Welch’s T-test since the variances of the two distributions are not equal. Per the null hypothesis, there is no correlation between the mean ratings for SPR and EA Sports. The hypothesis test resulted in the P-value = 0 (which is 0.05) and Cohen’s D value > 2, indicating that we must deny the null hypothesis and that there is a sizable statistically significant difference between the sample SPR and EA Sports rating standards. To determine if two random samples may be thought of as coming from the same distribution, the Kolmogorov-Smirnov test was also carried out. Additionally, it was determined that the P-value for this was zero, indicating that the two distributions are distinct.

In general, the distributions are thought to be highly different from one another, even though we discovered a substantial link between the SPR and EA Sports scores. This is excellent — if they were drawn from the same distribution I could just easily utilize EA Sports ratings, which is not the goal of this project.

Issues with Model:

This was my first time programming data analysis for the NHL through Python. With that said, the model in place was not perfect. One of the flaws was “defensive defensemen” were undervalued compared to EA Sports. One method that needs to be improved is giving these players more praise by adding more value to their overall rating. Although their total contribution to the team is significant and they generally exhibit strong “responsibility” traits, their lack of scoring points puts them behind “offensive and two-way defensemen” in the SPR rating system. The 82-game projection is an additional issue. It’s challenging to demonstrate what a player would have accomplished if he played only 48 games before suffering an injury and missing the remainder of the year. In order to evaluate players on the basis of appearance frequency, it was decided to project 82 games. It operates in a similar manner when determining “metric of choice” values for game rates. Considering the duration of the season and the fact that players might fatigue or regress with time, who is to say that a certain player would have maintained his ability to score, hit, and take away at the same pace? By taking into account how a player’s statistics vary with time, it is feasible to obtain some sort of foundation for this. For later editions, I will take this into consideration.

Final Thoughts:

Forecasting a rating for each player is great, but one must keep in mind that this is merely a model. It simply offers a glimpse of reality. Given the options, there are several elements listed throughout this article that, in my opinion, are crucial components of a hockey player. Of course, others may not agree. The ability to design such a model, as we have seen, is made possible by the usage of the SPR function. Therefore, I would not be shocked if EA Sports takes a similar route soon.

--

--

Ethan Kennemer

St. Louis / University of North Alabama / Rapsodo / Twitter: EthanKennemer