Soccer Analytics with Python

Published in

Top Level Sports

4 min readJun 24, 2021

In my last post https://ecavan.medium.com/hockey-analytics-w-python-7974748c6e88, I looked at how tracking data can be used to assess players on NHL teams. I mentioned how important this tracking data is in particular for sports like Hockey and Soccer, where it is hard to evaluate players outside of traditional statistics.

In this post (great timing as the Euro group stage just ended at the time of writing this) I’ll look at how to apply the same type of data to assess Soccer (European football) players.

The data, taken again from Sportslogiq, looks something like this:

As you can see we get tracking data, including the speed of coordinates of the ball, the coordinates of each player inside the frame, the game and ball status (alive/dead ball, game pauses) and the ball contact info which gives us information about set pieces. The data is also quite messy- as you can see by the Team and JerseyNumber columns which are not consistent.

The most interesting part for me about this data is that it can be used to determine the formation of the two teams at all frame. Also, using the helpful package mplsoccer https://pypi.org/project/mplsoccer/ it becomes really easy to plot the formations on a soccer pitch. For example, to get the formation for team 1:

pitch = Pitch(axis=True, label=True, tick=True)
fig, ax = pitch.draw()
pitch.scatter(gb1x['X2'], gb1y['Y2'], ax=ax, marker="+", color = 'red')
plt.show()

Which returns:

Here I’ve skipped some small steps; I needed to normalize the data from the coordinates in the dataframe (-52 x 52 for the width of the pitch for example) to the coordinates used on the mplsoccer plot (0–120 for the width). I also used a groupby to determine the average (I used the median which is pushed less by outliers) position of the players on the pitch.

In a similar way we can find the formation for team 2:

We can see both teams have very different strategies. This is particularly apparent when we plot both teams on the same pitch, for example:

Both Teams on the Same pitch (Image by Author)

As you can see, the blue team plays a much tighter defence compared to the red team, which likes to split out wide.

The other analytics I looked at were the total distance and average speed of the players for each team. For example, for the total distance in the ‘X’ direction for team 1:

tm1['dx'] = tm1.groupby('JerseyNumber')['X'].diff()gb1dx = tm1.groupby('JerseyNumber').dx.apply(lambda c: c.abs().sum()).reset_index(name = 'distanceX')

Here I find the difference between the X coordinates for each player (dx) , and add up the absolute value of each dx. to find the total distance, I used the Pythagorean formula with the total X and Y distances. To find the average speed I just divide the total distance by the number of ‘dx’ or ‘dy’ values I had to sum up and divide by 25 (the data was filmed with a frame rate of 25 fps). I also multiplied by 3.6 to change units from m/s to km/hr. For team 1, I get a dataframe like this:

Tracking Metrics for Team 1 (Image by Author)

The player with JerseyNumber 1 is the goalie, so it makes sense that his average speed is the lowest. For the other players; the top speed for a player in soccer is about 34 km/hr. So it makes sense for the players to have average speeds in the early to mid 20s in km/hr. In particular Jersey Numbers 2 and 7 are quite fast (on Team 2, the fastest player had an average speed of 24 km/hr).

I’ll make sure to link the entire Kaggle notebook when it is available. Thanks for reading!

As usual, if you want to support my work, check out:

https://elicavan.wixsite.com/site

https://www.linkedin.com/in/elijah-cavan-msc-14b0bab1/

Soccer Analytics with Python

Written by Elijah Cavan