Hockey Analytics w/ Python

Elijah Cavan
Top Level Sports
Published in
4 min readJun 24, 2021

--

With the Stanley Cup finals approaching, I’ve decided to share some work I’ve been doing using tracking data for NHL games. This is the next forefront of analytics- all team sports leagues (minus baseball- but in baseball it is easier given that most of the game is a one on one matchup between batter and pitcher) want to understand how to evaluate players based on their positioning on the field. This is especially important in sports like Hockey and Soccer, where it is hard to evaluate players outside of traditional stats like goals, plus-minus, assists, ect.

I acquired data from a company called Sportslogiq, which gave me the ability to study a regular season game between the Penguins and Capitals.

This is an example of what the data looks like:

Hockey Tracking Data (Image by Author)

As you can see; I get the game/period times, actions by a player (pass, faceoff, shot, ect), the current players of the ice, the zone (offensive zone, neutral zone or defensive zone) and the x,y coordinates of where the event happened. Using this data we can create useful plots (using Seaborn for example) to visualize shot attempts and passes from the home and away teams. For example, if I wanted to view the shot attempts:

sns.scatterplot(data=team_events_determined, x="xPlotCoord", y="yPlotCoord", hue=team_events_determined.teamShorthand.tolist())
plt.title('Shot Plot (PlotCoords) by Team')
plt.show()

Which gives me a plot that looks like this:

Shot Attempts by Team (Image by Author)

The only bit I’ve left out is filtering out shots outside of the offensive zone, and only considering shots in the dataframe (i.e filtering out event types like pass, faceoff). I’ve also played with the data so that the shots for all 3 periods for one team are on the same side (since we know teams shift sides each period).

Similarly, we can create plots to show us how teams are passing the puck. For example, this plot shows all the (completed ) passes for Washington defencemen.

team_rosters_events_only_pass_w_d = team_rosters_events_only_pass_w[team_rosters_events_only_pass_w.primaryPosition == 'D']
team_rosters_events_reception_w_d = team_rosters_events_reception_w[team_rosters_events_reception_w.primaryPosition == 'D']
fig = plt.figure()
ax1 = fig.add_subplot(111)
x1 = team_rosters_events_only_pass_w_d['xPlotCoord']
x2 = team_rosters_events_reception_w_d['xPlotCoord']
y1 = team_rosters_events_only_pass_w_d['yPlotCoord']
y2 = team_rosters_events_reception_w_d['yPlotCoord']
ax1.scatter(x1, y1, s=10, c='b', marker="s", label='Pass')
ax1.scatter(x2,y2, s=10, c='r', marker="o", label='Reception')
def connectpoints(x1,y1,x2,y2):
plt.plot([x1,x2],[y1,y2],'k-')

for i in range(len(x2)):
connectpoints(x1.iloc[i],y1.iloc[i],x2.iloc[i],y2.iloc[i])
plt.title('WSH Pass Plot Defensemen')
plt.show()

You could similarly look at passes by Pittsburgh, for certain positions, players or for the whole team. The pass plot looks like this:

Pass Plot for Capitals Dmen (Image by Author)

The last thing I want to show is how to calculate corsi based on the players who are on the ice. Corsi is a statistic that endeavours to determine if your team is generating good shot attempts while you are on the ice (a good corsi rating) as opposed to your line mostly being held in the defensive zone during your shift (a bad corsi rating). In particular, the Corsi ratio is is the ratio of shot attempts for divided by the total shot attempts while the player was on the ice.

Given that the column for players on the ice is a dictionary, this can be a bit weird to do. This was my solution:

####change string dictionary to dictionary####rosters_events_shots['playersOnIce'] = rosters_events_shots['playersOnIce'].apply(lambda x: eval(x))###Split the dictionary for each column into lists based on if the team is the Capitals or the Penguins#####rosters_events_shots.loc[rosters_events_shots['teamId'] == 30, 'teamPlayersOnIds'] = rosters_events_shots['playersOnIce'].apply(pd.Series)['30']rosters_events_shots.loc[rosters_events_shots['teamId'] == 30, 'opponentPlayersOnIds'] = rosters_events_shots['playersOnIce'].apply(pd.Series)['31']####counts the number of times a particular player is on the Ice for a shot attempt#####rosters_events_shots_success['opponentPlayersOnIds'].explode().value_counts()###calculate the corsi rating####corsi.loc[corsi['team'] == 30, 'corsi'] = (corsi['Sucess_shots_PIT'])/(corsi['Success_shot_WSH'] + corsi['Fail_shots_WSH'])

Right now there is no Kaggle notebook for this work (it’s on my local computer), but I will link it here when it is available. If you have questions, feel free to reach out using the links below.

So finally, we get a dataframe (that we sort) to get the best and worst players by corsi over the course of this particular game:

Top 5 players by Corsi (Image by Author)
Bottom 5 players by Corsi (Image by Author)

I had a ton of fun doing this- I hope this hows you some of the ways how NHL teams are calculating advanced metrics to evaluate players!

As usual, thanks for reading, if you want to see more of my work visit:

https://elicavan.wixsite.com/site

https://www.linkedin.com/in/elijah-cavan-msc-14b0bab1/

--

--