DeepxG Tutorial Part 1: Train your own Deep Learning Model to predict Expected Goals (xG)
Dataset creation using python, statsbomb, mplsoccer and pandas
Introduction
Most of us who follow football have heard about a statistic called Expected Goals popularly referred to as xG. Expected Goals is a statistic that is used to quantify the quality of a goal scoring chance. Usually xG is computed on a per shot basis, with the goal being (pun fully intended) finding out the probability of a shot resulting in a goal. The statistic that is shown to us on television screens is usually the total xG of all the shots taken from the start of the game till the point of time the stat is displayed i.e. it is the summation of the xG of all shots.
There are plenty of ways xG can be utilized here are a few:
- Managers can assess whether their team is creating high quality chances or not.
- Managers and attacking players can identify the zones of the pitch that correspond to a high xG, this information lets players know whether to shoot or to move the ball to a different zone.
- Defenders can identify which zones they should occupy. Ideally defenders would want occupy zones that correspond to a high xG to deter any attacks and lead the ball to zones with low xG.
- Analysts and scouts can identify which strikers to buy based on their xG over a season(s).
In this series of articles we will walk through how anyone can build their own xG model. We’ll make use of the data open sourced by Statsbomb available here. In this article we use 17 seasons of event data from La Liga to train, validate and test our model.
This article will be a three part series:
- Data Preparation : In this part we will show how a dataset can be created. The data transformations, cleaning etc. involved.
- Model Training : We’ll show you how to build, train and test a deep learning model leveraging pytorch and pytorch-lightning.
- Model Evaluation : We’ll run some experiments to further evaluate our model and show the different ways a xG model can be used.
Data Preparation
Todays article will focus on creating a dataset that can be used for training any machine/deep learning model to predict xG. We will primarily use the Pandas and MPLSoccer libraries. Of the 17 seasons of La Liga data available we use 15 seasons of data as our training set, 1 season for validation and another season for test. In order to get the event data for each season we need to obtain the match ids of all the games that statsbomb has data available. This is what we do in the code snippet below.
Please note that we’ve already obtained the season ids we want to use by leveraging the competitions data released by statsbomb. On running this function we’ll have three match_id.csv files corresponding to the datasets we’ll create for train, validation and test.
Features
In the next step we’ll extract some features from the event data for each match to create our datasets. We use the following features to train our model.
- Distance to goal : The distance from where the shot was taken to the center of the goal.
- Angle between goal posts : In his book and course Soccermatics, David Sumpter explains the importance of the visibility of the face of the goal to the player taking a shot in determining whether it is easy or hard to score a goal from a given position. The angle from where the shot is taken to the two goal posts is a good estimate of how much of the face of the goal can be seen.
- Shot Outcome : To determine whether a shot resulted in a goal or not.
- Technique name : The type of shot the player takes, eg. volley, half volley etc.
- Under Pressure : To identify whether the player taking the shot is under pressure or not. The more pressure a player is under the harder it is for them to have a clean strike, so theoretically it should be harder for players to score when they are under pressure.
- Body Part Name : The part of the body that was used to take the shot. This feature primarily helps us distinguish between headers and strikes with either feet.
- Position Name : The position of the player taking the shot. We would assume that a defender is worse at finishing compared to a forward. Having this feature helps the model learn which positions the good finishers tend to play in.
- Pass Technique Name : This feature is used to describe the type of pass that was provided to the player taking the shot. It can be used to determine if it was a through ball, cross etc.
- Shot Zone : We divide the pitch into 80 zones similar to how we did in our previous article while creating dynamic passing networks. We use the zone from which the shot is taken as a feature. This is because certain zones are easier to score from compared to others.
- Pass Sequence : We keep track of all the passes that were involved in the build up to the shot. The intention behind this is that including the passing sequence might be indicative of the amount of space created and the type of chance, for example if it was a counter attack or resulted from a counter-pressing situation where the ball was won high up the pitch etc.
All of this is implemented in the function shown below
The first few lines (1–7) of code get the event data for a given match id. We sort the event data frame by the index to ensure that all the events are in the right temporal(sorted by time) order. For each new match we initialize two variables called prev_possession_team and pass_sequence.
prev_possession_team keeps track of the team that was in possession until the previous event. This variable is important in determining whether a passing sequence has ended or not. A passing sequence ends when the team that was in possession making all the passes, lost the ball and are no longer in possession.
The pass_sequence variable keeps track of all the passes in a given move (line 12-13). A passing sequence can be interrupted in multiple ways, such as a player being dispossessed, miscontrolling the ball, the ball going out of play etc. Line (14–15) ensures that we keep track of the type of pass i.e. through ball/ inswinging cross etc. if the pass resulted in a shot.
Lines (17–32) gather all the features with respect to a shot if the event is a shot. We note the location of the shot to find the distance and angle of the shot between the two goal posts. If no pass technique was found prior to the shot being taken it means that the pass technique was a nan line (27–28). If the play was from a free kick or a penalty we empty all the passing sequences up to that point of time (line 29–30). Finally, we empty the passing sequence array at the end of a shot to start our new attacking move.
Lines (34–38) are used to find when a passing sequence ends irrespective of whether a shot was taken or not. We assume that the passing sequence is continuing as long as any of the events are in the list mentioned on line 34 and possession hasn’t changed to the other team. The Pressure event applies to when the opposition player is applying pressure on the player in possession of the ball.
Below are the functions used for finding the distance to the goal and the angle between the goal posts and the zone the ball is present in. Credit for the get_shot_angle function goes to this post on stackoverflow.
Data Transformation
The final stage of our dataset preparation involves some data transformations. We perform the following transformations:
- Convert all categorical variables to a unique id. We have the following categorical variables.
CATEGORICAL_VARIABLES = [SHOT_TECHNIQUE_NAME, BODY_PART_NAME, POSITION_NAME, PASS_TECHNIQUE_NAME]
- Convert the co-ordinates from where the shot was taken and where the pass was made to a zone number.
- Standardize the two continuous variables, shot distance and angle. To ensure that the order of magnitude of the two variables isn’t a problem during model training.
Lines 3–16 convert the shot and pass co-ordinates to a zone using the utility function shown earlier. In case the co-ordinates are nan’s we give a default zone of 80 (remember that we have 80 zones numbered from 0–79 since indexing starts from 0 so index 80 corresponds to a fallback zone).
Next (lines 18–48) we transform our categorical and continuous variables using some neat pandas tricks. Remember that we need to learn all parameters for data transformations and mappings from the training data only. For this reason, we compute the mean and standard deviation of each of the continuous variables in the training set and save it as a json file. The same mean and standard deviation are applied to the raw data from the validation and testing sets after loading the saved json file (lines 41, 49–50).
For the categorical variables we save all the categories for each variable in the training inside a dictionary and save the dictionary as a json file too (lines 29–38). When we are transforming the validation and test data, in case a given category isn’t present in the training set we give it a unique id equal to the length of the number of categories, leveraging the fact that indexing starts from 0 and the category at the index equal to the length of the list of categories does not actually exist.
Conclusion
Now we have our training, validation and testing sets all ready to train a deep learning/machine learning model. In Part 2 of this series we’ll walk through some pytorch and pytorch-lightning code to setup the Data Modules, model architecture, training logic etc. Drop a comment below if you think we can use some other cool features to train a xG model.
If you’re interested in continuing to learn more about how data science and machine learning can be applied to the world of football do follow us and check out some of our previous articles!