Women’s Soccer Expected Goals — Data Cleaning— Part-Five: Assessing Transforming Location Coordinates

9 min readNov 16, 2021

Part-five of a series for the data cleaning portion of the data science workflow creating a women’s club soccer expected goals (xG) classification model, explaining the process of assessing location-descriptive features and splitting them into individual coordinate features.

GitHub — wswager/womens_soccer_expected_goals_model: Classification model for expected goals (xG)…

Classification model for expected goals (xG) in women’s club soccer, predicting the likelihood that a shot will score…

github.com

Background

Continued from parts-one-through-four of the data cleaning process:

Soccer Expected Goals — Data Cleaning — Part One: Defining the Target Feature

Explaining the process of defining the target feature for the eventual model, goal.

medium.com

Soccer Expected Goals — Data Cleaning — Part Two: Irrelevant Data

Explaining the process of dropping of irrelevant data.

medium.com

Soccer Expected Goals — Data Cleaning — Part Three: Correcting Datatypes

Explaining the process of correcting the datatypes.

medium.com

Soccer Expected Goals — Data Cleaning — Part Four: Processing Missing Values

Explaining the process of identifying, assessing, and processing missing values.

medium.com

Introduction

Expected Goals (xG)

xG is used to indicate the quality of a shot.

xG, as a metric, measures the likelihood that a shot will result in a goal based on the characteristics of the shot and the play preceding the shot.

xG is measured on a scale between zero and one, with one representing a goal. For example, a shot with a 0.5 xG indicates a shot having a 50% chance of being a goal.

What is xG?

How the expected goals (xG) metric calculated and used within soccer.

medium.com

Classification Model

The metric of expected goals is calculated through the use of a classification model.

A classification model refers to a predictive modeling problem where a class label is predicted for a given example of input data.

For this project a supervised approach was used with the training data for the model including which shots were goals.

Data Cleaning

Data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

Wikipedia

The Data

The data for this project was extracted from StatsBomb’s Open Data.

StatsBomb are a United Kingdom based football (soccer) data analytics company. StatsBomb provide free access to a segment of their proprietary dataset via GitHub: StatsBomb Open Data

For the purposes of this project the relevant data targeted was, primarily, characteristics of shots and, secondarily, characteristics of the plays creating those shots, from women’s club soccer matches.

Note: Assessment of plays creating shots is subjective and based on domain knowledge specific to the sport of soccer

Data Cleaning

Working Data

Previously, in Women’s Soccer Expected Goals — Data Extraction, shot event data and key pass features were extracted for target women’s club competitions, competition ids, and season ids within StatsBomb Open Data.

extracted_data.head()

print("Total Women's Club Shot Events:", len(extracted_data))Total Women's Club Competition Shot Events: 6114print('Updated Shot w/ Key Pass Features:', extracted_data.shape[1])Updated Shot w/ Key Pass Features: 81

Review from Parts-One-through-Four

Previously, in Women’s Soccer Expected Goals — Data Cleaning — Part One: Defining the Target Feature, shot_outcome was updated from a categorical feature to the boolean feature goal, ‘True’ if a shot resulted in a goal, ‘False’ if not. The new feature, goal represents the target feature for the eventual supervised classification modeling.

extracted_data['goal'].value_counts()False    5416
True      664
Name: goal, dtype: boolprint("goal Percent 'True':", round((((sum(extracted_data['goal'])) / (len(extracted_data))) * 100), 2), '%')goal Percent 'True': 10.92 %

Next, in Women’s Soccer Expected Goals — Data Cleaning — Part Two: Irrelevant Data, irrelevant features were dropped:

Features duplicated as a result of extracting and concatenating shot event and the pass event data
Features deemed as not characteristics of shots or plays creating shots

extracted_data.shape[1]21list(extracted_data.columns.values)['period_x',
 'timestamp_x',
 'play_pattern_x',
 'location_x',
 'under_pressure_x',
 'shot_statsbomb_xg',
 'shot_end_location',
 'shot_technique',
 'goal',
 'shot_type',
 'shot_body_part',
 'shot_one_on_one',
 'shot_open_goal',
 'shot_first_time',
 'shot_redirect',
 'shot_deflected',
 'shot_follows_dribble',
 'pass_length',
 'pass_angle',
 'pass_height',
 'pass_type',
 'pass_switch',
 'pass_through_ball',
 'pass_technique',
 'pass_backheel',
 'pass_cross',
 'counterpress',
 'pass_cut_back',
 'pass_inswinging',
 'pass_straight',
 'pass_outswinging']

Next, in Women’s Soccer Expected Goals — Data Cleaning — Part Three: Correcting Datatypes, the data types of the features were assessed and corrected:

Features identified as boolean in nature were updated from object to boolean datatype
timestamp_x was updated from object to datetime datatype

extracted_data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 6080 entries, 0 to 6079
Data columns (total 31 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   period_x              6080 non-null   int64         
 1   timestamp_x           6080 non-null   datetime64[ns]
 2   play_pattern_x        6080 non-null   object        
 3   location_x            6080 non-null   object        
 4   under_pressure_x      6080 non-null   bool          
 5   shot_statsbomb_xg     6080 non-null   float64       
 6   shot_end_location     6080 non-null   object        
 7   shot_technique        6080 non-null   object        
 8   goal                  6080 non-null   bool          
 9   shot_type             6080 non-null   object        
 10  shot_body_part        6080 non-null   object        
 11  shot_one_on_one       6080 non-null   bool          
 12  shot_open_goal        6080 non-null   bool          
 13  shot_first_time       6080 non-null   bool          
 14  shot_redirect         6080 non-null   bool          
 15  shot_deflected        6080 non-null   bool          
 16  shot_follows_dribble  6080 non-null   bool          
 17  pass_length           4138 non-null   float64       
 18  pass_angle            4138 non-null   float64       
 19  pass_height           4138 non-null   object        
 20  pass_type             960 non-null    object        
 21  pass_switch           6080 non-null   bool          
 22  pass_through_ball     6080 non-null   bool          
 23  pass_technique        355 non-null    object        
 24  pass_backheel         6080 non-null   bool          
 25  pass_cross            6080 non-null   bool          
 26  counterpress          6080 non-null   bool          
 27  pass_cut_back         6080 non-null   bool          
 28  pass_inswinging       6080 non-null   bool          
 29  pass_straight         6080 non-null   bool          
 30  pass_outswinging      6080 non-null   bool          
dtypes: bool(17), datetime64[ns](1), float64(3), int64(1), object(9)
memory usage: 813.4+ KB

Next, in Soccer Expected Goals — Data Cleaning — Part-Four: Processing Missing Values, missing values were identified, assessed and processed.

pass_length, pass_angle, and pass_height each had 1,942 missing values. It was assumed that these shot events were not preceded by a pass, ‘no pass shots’:

print('pass_length NA:', sum(extracted_data['pass_length'].isna()), '\n', 'pass_angle NA:', sum(extracted_data['pass_angle'].isna()), '\n', 'pass_height NA:', sum(extracted_data['pass_height'].isna()))pass_length NA: 1942 
 pass_angle NA: 1942 
 pass_height NA: 1942

Numerical pass-related features were replaced with 0
Categorical pass-related features were replaced with ‘No Pass’
Boolean pass-related features did not require filling as they already indicated ‘False’

Additional missing values for pass_type were assumed to be passes from open-play and filled with ‘Open Play.’

extracted_data['pass_type'].value_counts(dropna = False)Open Play       3177
No Pass         1943
Corner           400
Recovery         305
Free Kick        201
Throw-in          42
Interception      10
Kick Off           1
Goal Kick          1
Name: pass_type, dtype: int64

Additional missing values for pass_technique were assumed to be standard passes and were filled with ‘Standard.’

extracted_data['pass_technique'].value_counts(dropna = False)Standard        3782
No Pass         1943
Through Ball     198
Inswinging        76
Outswinging       55
Straight          26
Name: pass_technique, dtype: int64extracted_data.isnull().sum()period_x                0
timestamp_x             0
play_pattern_x          0
location_x              0
under_pressure_x        0
shot_statsbomb_xg       0
shot_end_location       0
shot_technique          0
goal                    0
shot_type               0
shot_body_part          0
shot_one_on_one         0
shot_open_goal          0
shot_first_time         0
shot_redirect           0
shot_deflected          0
shot_follows_dribble    0
pass_length             0
pass_angle              0
pass_height             0
pass_type               0
pass_switch             0
pass_through_ball       0
pass_technique          0
pass_backheel           0
pass_cross              0
counterpress            0
pass_cut_back           0
pass_inswinging         0
pass_straight           0
pass_outswinging        0
dtype: int64

Working data prior to step-five:

print("Total Women's Club Shot Events:", len(extracted_data))Total Women's Club Competition Shot Events: 6114print('Updated Shot w/ Key Pass Features:', extracted_data.shape[1])Updated Shot w/ Key Pass Features: 31extracted_data.head()

Location Features

The shot event data contains two location-descriptive features, both of which include multiple coordinates:

location_x describes the location on the field from which the shot was taken and contains x and y-coordinates

extracted_data['location_x'].head()0    [109.0, 46.0]
1    [113.0, 35.0]
2     [94.0, 43.0]
3     [86.0, 34.0]
4     [94.0, 33.0]
Name: location_x, dtype: object

shot_end_location describes the location on the field where the shot’s path ended and contains x, y, and z-coordinates

extracted_data['shot_end_location'].head()0         [112.0, 45.0]
1    [120.0, 32.9, 0.4]
2    [120.0, 42.8, 0.5]
3    [119.0, 33.3, 0.5]
4    [120.0, 34.8, 0.5]
Name: shot_end_location, dtype: object

location_x

Split location_x into separate x and y-coordinates

shot_location_df = pd.DataFrame(extracted_data['location_x'].tolist(), index = extracted_data.index)shot_location_df.head()

Replace location_x with shot_location_x and shot_location_y

extracted_data.drop('location_x', axis = 1, inplace = True)extracted_data['shot_location_y'] = shot_location_df[0]
extracted_data['shot_location_x'] = shot_location_df[1]extracted_data[['shot_location_y', 'shot_location_x']].head()

extracted_data['shot_location_x'].describe()count    6043.000000
mean       40.286745
std         9.782598
min        12.400000
25%        33.000000
50%        40.100000
75%        47.000000
max        68.200000
Name: shot_location_x, dtype: float64extracted_data['shot_location_y'].describe()count    6043.000000
mean      104.003690
std         8.819068
min        77.800000
25%        97.700000
50%       105.600000
75%       111.000000
max       120.000000
Name: shot_location_y, dtype: float64

shot_end_location

Split shot_end_location into separate x and y-coordinates

end_location_df = pd.DataFrame(extracted_data['shot_end_location'].tolist(), index = extracted_data.index)end_location_df.head()

Assess the values of shot_end_location’s z-coordinate

print('shot_end_location z-coordinate NA:', (sum(end_location_df[2].isna())), '\n', 'Percent shot_end_location z-coordinate NA:', (round((((sum(end_location_df[2].isna())) / (len(extracted_data))) * 100), 2)), '%')shot_end_location z-coordinate NA: 1796 
 Percent shot_end_location z-coordinate NA: 29.54 %

Drop shot_end_location z-coordinate due to too many missing values

Replace shot_end_location with shot_end_location_x, shot_end_location_y, and shot_end_location_z

extracted_data.drop('shot_end_location', axis = 1, inplace = True)extracted_data['end_location_y'] = end_location_df[0]
extracted_data['end_location_x'] = end_location_df[1]extracted_data[['end_location_y', 'end_location_x']].head()

extracted_data['end_location_x'].describe()count    6043.000000
mean       40.155453
std         6.314044
min         0.100000
25%        36.400000
50%        40.000000
75%        43.800000
max        80.000000
Name: end_location_x, dtype: float64extracted_data['end_location_y'].describe()count    6043.000000
mean      116.005527
std         6.252442
min        84.000000
25%       115.000000
50%       119.000000
75%       120.000000
max       120.000000
Name: end_location_y, dtype: float64

Results

The values for the two location-descriptive features, location_x and shot_end_location, were split into separate coordinates. The location-descriptive features were then dropped and replaced with new numeric features for the separate coordinate values:

shot_location_x
shot_location_y
end_location_x
end_location_y

Note: The z-coordinate for shot_end_location was dropped.

Note: The previous location-descriptive features contained object datatypes. The new coordinate features float64 datatypes.

print("Total Women's Club Shot Events:", len(extracted_data))Total Women's Club Competition Shot Events: 6114print('Updated Shot w/ Key Pass Features:', extracted_data.shape[1])Updated Shot w/ Key Pass Features: 33extracted_data.head()

extracted_data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 6080 entries, 0 to 6079
Data columns (total 33 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   period_x              6080 non-null   int64         
 1   timestamp_x           6080 non-null   datetime64[ns]
 2   play_pattern_x        6080 non-null   object        
 3   under_pressure_x      6080 non-null   bool          
 4   shot_statsbomb_xg     6080 non-null   float64       
 5   shot_technique        6080 non-null   object        
 6   goal                  6080 non-null   bool          
 7   shot_type             6080 non-null   object        
 8   shot_body_part        6080 non-null   object        
 9   shot_one_on_one       6080 non-null   bool          
 10  shot_open_goal        6080 non-null   bool          
 11  shot_first_time       6080 non-null   bool          
 12  shot_redirect         6080 non-null   bool          
 13  shot_deflected        6080 non-null   bool          
 14  shot_follows_dribble  6080 non-null   bool          
 15  pass_length           6080 non-null   float64       
 16  pass_angle            6080 non-null   float64       
 17  pass_height           6080 non-null   object        
 18  pass_type             6080 non-null   object        
 19  pass_switch           6080 non-null   bool          
 20  pass_through_ball     6080 non-null   bool          
 21  pass_technique        6080 non-null   object        
 22  pass_backheel         6080 non-null   bool          
 23  pass_cross            6080 non-null   bool          
 24  counterpress          6080 non-null   bool          
 25  pass_cut_back         6080 non-null   bool          
 26  pass_inswinging       6080 non-null   bool          
 27  pass_straight         6080 non-null   bool          
 28  pass_outswinging      6080 non-null   bool          
 29  shot_location_y       6080 non-null   float64       
 30  shot_location_x       6080 non-null   float64       
 31  end_location_y        6080 non-null   float64       
 32  end_location_x        6080 non-null   float64       
dtypes: bool(17), datetime64[ns](1), float64(7), int64(1), object(7)
memory usage: 908.4+ KB

Continued

Part-six continues the series for the data cleaning portion of the data science workflow creating a women’s club soccer expected goals (xG) classification model, explaining the process of identifying, assessing, and processing outliers.

If you liked this post, please give it an applause and follow me as I will be continuing with a series of posts for each process through the data science workflow of my Women’s Soccer Expected Goals Model:

Wes Swager — Medium

Read writing from Wes Swager on Medium. Data scientist, soccer fan, fitness enthusiast. Every day, Wes Swager and…

medium.com

Also, follow me on Twitter, where I post regularly about tactical observations for soccer:

JavaScript is not available.

Edit description

mobile.twitter.com

I would love to read any feedback you might have in the comments.

Women’s Soccer Expected Goals — Data Cleaning— Part-Five: Assessing Transforming Location Coordinates

GitHub — wswager/womens_soccer_expected_goals_model: Classification model for expected goals (xG)…

Classification model for expected goals (xG) in women’s club soccer, predicting the likelihood that a shot will score…

Background

Soccer Expected Goals — Data Cleaning — Part One: Defining the Target Feature

Explaining the process of defining the target feature for the eventual model, goal.

Soccer Expected Goals — Data Cleaning — Part Two: Irrelevant Data

Explaining the process of dropping of irrelevant data.

Soccer Expected Goals — Data Cleaning — Part Three: Correcting Datatypes

Explaining the process of correcting the datatypes.

Soccer Expected Goals — Data Cleaning — Part Four: Processing Missing Values

Explaining the process of identifying, assessing, and processing missing values.

Introduction

Expected Goals (xG)

What is xG?

How the expected goals (xG) metric calculated and used within soccer.

Classification Model

Data Cleaning

The Data

Data Cleaning

Working Data

Review from Parts-One-through-Four

Location Features

location_x

shot_end_location

Results

Continued

More

Wes Swager — Medium

Read writing from Wes Swager on Medium. Data scientist, soccer fan, fitness enthusiast. Every day, Wes Swager and…

JavaScript is not available.

Edit description

Written by Wes Swager