Can NHL player performance metrics help predict Salary?

11 min readOct 4, 2018

Every summer, we usually end up with a few players (some considered high-end talent) who are still without a contract coming into a new season. There is often a stalemate between owners/General Managers (GMs) and a given player. Neither side tends to want to budge on their side of the fence: The player wants more, but the organisation wants to pay less. Along with the NFL and NBA, the NHL finds itself in a cap-dominated market, to which only a certain amount of money is allowed to be spent by each team and for each owner/GM it is like fitting a jigsaw puzzle together, trying to balance the books, but also make the team competitive. “The cap” was introduced for the 2005–2006 season and was set at $39 million (US used here) and it has steadily* increased every year where today (2018–2019) it is set at $79.5 million. Furthermore, player salaries have overall increased during this time, where high-end players set the benchmark for the top-end salaries, whilst the others in the pack gladly follow behind, picking up an inflated pay-cheque. Hence, it is in the interest of the players that the high-end players set the mark.

I have often wondered “what sets the market value?” Is it hockey performance alone that sets the bar? In this project, I have devised a tool that will help owners and GMs analyse the market in order to be able to ascertain player value given previous player performance. Furthermore, it is not only of the interest of the owners, GMs, and players, it is also in the interest of other parties, such as; sponsors, agents, and of course the fans. Whilst this project would have liked to explored all types of players, goalies have been omitted as their metrics are very different from skaters and would be a project on itself. As there are more skaters than goalies, there is more data, and it only makes sense to start with them.

Player metrics and data

Traditionally, hockey players have been merited on basic statistical features, including, Goals, Assists, Points (Goals + Assists), and Plus-Minus (the difference between goals scored for and against whilst on the ice). However, since 2008 a more sophisticated set of tools has been made available by various parties that look more into the detail of how players perform in terms of possession; Corsi: Shots for vs against whilst on the ice, Rel Corsi: how your team compares when a player is not on the ice, CF QoC: The level of competition a player faces, TOI metrics: Time on Ice parameters, ZS: which zone a player is most likely to start their shift in, and many more. I decided to best put to use a combination of traditional and advanced metrics. I collected these data from two sites (www.corsica-hockey.com and www.hockey-reference.com). The former’s data is obtainable via downloadable csv files, whilst the latter I scraped using Python’s Beautifulsoup library. Each player who has played in the NHL since 2008 — present was collected. I also needed financial data about each player’s’ salary for each year they have played and this was obtained from www.capfriendly.com using beautifulsoup. There was a fair bit of data cleaning needed due to missing entries, but overall, I had close to 2000 NHL players spanning 11 years and just over 9000 rows of usable data. It should be noted that only data where teams played at “even-strength” were considered. This is a fairer method of measurement compared to when one team is, for example, on the Power Play and has a distinct advantage.

Fig 1. Salary cap and maximum salary evolution

It is recognized that there are two different markets for skaters: one for forwards and one for defenders. Figure 1 shows how the salary cap has increased per year as well as the highest paid salary (all players). The year of 2013 was an anomaly due to cap changes after the 2012–2013 lockout (*)

Data Exploration

The first thing to consider is salary itself which is represented as a histogram in Figure 2. As can be seen that the majority of salary is less than one million dollars per season. This can be explained by mainly young players that come into the league under an “Entry Level Contract”, (ELC) which is a non negotiable contract that pays the minimum amount of money per season. The length for such a contract is three years and applies to all players under the age of 24 and younger once coming into the league. The contract kicks in once a player takes a starting roster spot for more than 9 games. It can be seen that there is a fairly even distribution in salary between 2–4 million dollars, whilst there is a gradual drop off, where very few are making the “big bucks”.

Fig 2. Histogram of NHL player salaries from 2008–2018

Figure 3 illustrates age as a function of salary and it can be clearly seen that most player’s ELC are complete by the age of 22, where they sign a new and often larger contract. Figure 3 also shows that most players are between 24 and 28 years old and top salaries are typically awarded during those “prime years”. Note that player contracts are typically about 3 years in length, although high-end talent typically sign for 5 years or more where more money is awarded at the beginning of the contract and drops off as a player ages. After the age of 30, players tend to sign shorter contracts for less money. There are a number of players that were active beyond 40 but typically command less than 6 million dollars a year.

Fig 3. Jointplot showing the relationship between age and salary

Whilst there are many merits to a hockey player, the one many are interested in are points. I take a slightly different stance and consider Total Point Shares, which are shares rewarded to players depending on team performance and using a factor known as “Marginal Goals for and against ”(https://www.hockey-reference.com/about/point_shares.html). Essentially, the expected number of points shares should be close to the total number of points a player will have during a given season, but it also provides negative numbers for players that have been more detrimental to team performance. TPS is a combination of offensive and defensive individual point share values (OPS + DPS). Figure 4 shows the relationship between TPS and salary and it is quite clearly seen that the more TPS you have, the more likely you are to be compensated for that. Productivity by putting the puck in the net is what drives success. An interesting feature seen here are some players who are not being paid very much, but have still have large TPS. These are players who are still serving their ELC, but entered the league at a slightly older age (but still less than 25). The dynamic of the game has changed over the last 10 years, where it was more likely a player would enter the league at a later age 5–10 years ago compared to today’s faster paced game. Player position is also shown here, and interestingly there is evidence to show that defensive (D) TPS numbers are similar to forwards (LW,RW, and C) whilst also considering there are fewer Defensive players compared to forwards (the forward to defensive ratio is 3:2 on the ice). It shows, if anything, that the game is controlled at the “blue-line” by defensive styled-quarterbacks.

Fig 4. Spread of players Total Point Shares as a function of salary

Player role and its importance

As a GM or owner, I would want to know what qualities to look for in a player if he was to serve a particular purpose. Every player has a natural ability to do some things well, whilst few can do multiple things well. Hockey, like any team sport, works like chess utilizing different player piece-capabilities in order to win the game. I wanted to try and identify those metrics from the statistics I had collected. Figure 5 shows the results of a K-means algorithm to try and cluster players based on their specific role. This was done by considering the percentage of times a player would start his shift in the offensive zone (Offensive Zone Starts, OZS) compared to the quality of players he was facing (CFQoC). For forwards, I created three clusters: one for defensive-minded players, one for “two-way” players, and one for purely offensive-minded players. For defencemen, I created two clusters: one for defensive-styled players, and one for more offensive type players. I chose the number of cluster splits to represent the typical roles players would play during a game. As the distributions of these samples are Gaussian with a mean around 50%, the shapes of each cluster are fairly evenly distributed around this mean value. The two-way forwards represent 53% of the total distribution, which makes sense given that the game is played through neutral ice and game stoppages and restarts tend to favour a neutral starting position for each team.

Fig 5. K-means algorithm where players are clustered according to their role on the ice

Whilst TPS and salary have a clear correlation, I wanted to find other parameters that also might influence salary contribution. I used scikit-learn’s Random-Forest classifier algorithm that attempted to place players into three separate normalized salary bins for each of the roles devised by the K-means algorithm. The model was run 30 times (using boot-strapping and 1000 trees per model run) splitting data into random training and testing samples. Without considering optimizing for hyperparameters, I have provided a list of the top-15 dominating features for each role (see Figure 6).

Fig 6. Main features, and test scores for different player roles as determined by a Random-Forest classification model when labeling player salaries

It can be seen that in all five roles that parameters related to TOI (Time On Ice) are the most significant, but age, draft position, stats related to production (points, assists, goals, TPS, OPS, expected goals for (xGF) and against (xGA), corsi for (CF) and against (CA), shots (S), and shots attempted (SA)) occur regularly at the top. Blocked shots is an important feature for Defensive-styled defencemen, which does not occur in any other category. Funnily enough, the number of times a player gives the puck away (GWY) was a major contributor to two-way forwards, but this is merely inherent on the fact that these players are more likely to handle the puck in the neutral zone when under pressure due to defensive pressure and turn the puck over. The average prediction score values for each role ranged between 64% and 73% and compared well to training set data.

Overall, there was certainly justification for separating players into different roles based on some of the different hyperparameters found.

Modeling salary using multiple linear regression

Whilst I had already performed some level of prediction using classification, I wanted to devise a continuous measure between player performance and salary. Due to limits in sample numbers between the separate role categories, an independent fit to each sample proved to be more challenging, hence two samples were devised comprising one for forwards and one for defencemen. Furthermore, I only considered players in their last year of a given contract. The reason for this is that it provides a better representation of player behaviour as I found that there is a stronger correlation between various metrics and salary in those years. This is probably because players know very well that they have to perform in order to sign a new and perhaps better contract. I then investigated co-linearity between variables using visualization and calculating Value Inflation Factors (VIF), removing multiple parameters with highly correlated values (and obviously leaving one for the fit). A multiple regression fit was made using Statsmodels combining multiple variables that would hopefully explain as much of the variation in salary as possible, splitting data from the two samples-sets into training (70%) and test sets (30%). The model was run 1000 times for each sample set and average results were considered. A typical result for forwards who were rewarded with larger contracts is summarized in Figure 7 where it can be seen that 66% (after averaging) of the variation can be explained by the input parameters. Regression values for the defensive sample of players performed slightly worse where the R² = 59%. The RMSE of both model tests was recorded to be about 0.12 (forwards)and 0.15 (defense). I also considered players who had their salary reduced, and with using the same input parameters, the model R² scores were 0.42 (RMSE = 0.08) and 0.43 (RMSE = 0.12) for forwards and defense, respectively.

Fig 7. Regression model result of forward’s predicted normalized salary value vs Test set values. The histogram shows the distribution of the residuals around the 1–1 fit line.

It is unfortunate that the model cannot explain more of the variation beyond approximately 60% of the data. However the model serves as a fair estimator of salaries for the average player, or someone commanding less than 60% of the maximum salary. The unexplained 30–40% is thus from other sources of variation which performance stats (or at least the ones used here) alone cannot explain. The market is inevitably driven by other revenue related factors, such as merchandise sales, ticket prices, and promotions, as well as, the “like-ability” factor, especially for top end players to whom fans flock to see. Perhaps an inclusion of merchandise sales, like jersey sales, would help this model? As the model underestimates the upper end of the market (and those with a salary reduction), it might also imply that many players are not performing to their capabilities. Perhaps this is due to undisclosed injuries, or perhaps there are personal issues that we are not aware of. Such “human character” factors are almost impossible to account for. Finally, the assumption of a linear fit is possibly not valid and perhaps another non-linear technique or another ML algorithm would be more applicable. Further testing would be needed.

Conclusions

This project has investigated NHL player value using hockey performance metrics. Close to 2000 Hockey players were investigated over a period of 11 seasons from 2008 to 2018. Key statistical parameters were identified for specific types of player role using K-means clustering and Random-Forest classification. Moreover, a multiple linear regression was fit to forwards, as well as, defencemen for players in contract years, where player normalized salary was estimated in each case. The regression showed fits to have RMSE and R²of 0.12 and 66% for forwards and 0.15 and 59% for defencemen, respectively, for players who earned a pay-rise. Conversely, for the same input parameters, the model fared worse where the RMSE and R² were 0.08 and 42%, 0.12 and 0.43% for forwards and defense, respectively.

Whilst this model can only account for about 60% of the market value, it would help a GM make decisions about a player’s value who has more of a depth role on a team and that is invaluable for team success. Putting a quality team together is complex. Take the NHL draft for example, the first round is always a gift, as you know you are going to acquire a quality player. However, the real work comes in the the later rounds as teams scramble to find those few quality picks that have extreme potential and that’s where scouting players, player performance statistics, and potential value matters.