Introduction

Expected Goals (xG)

xG is used to indicate the quality of a shot.

xG, as a metric, measures the likelihood that a shot will result in a goal based on the characteristics of the shot and the play preceding the shot.

xG is measured on a scale between zero and one, with one representing a goal. For example, a shot with a 0.5 xG indicates a shot having a 50% chance of being a goal.

Classification Model

The metric of expected goals is calculated through the use of a classification model.

A classification model refers to a predictive modeling problem where a class label is predicted for a given example of input data.

For this project a supervised approach was used with the training data for the model including which shots were goals.

Data Cleaning

Data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

Wikipedia

The Data

The data for this project was extracted from StatsBomb’s Open Data.

StatsBomb are a United Kingdom based football (soccer) data analytics company. StatsBomb provide free access to a segment of their proprietary dataset via GitHub: StatsBomb Open Data

For the purposes of this project the relevant data targeted was, primarily, characteristics of shots and, secondarily, characteristics of the plays creating those shots, from women’s club soccer matches.

Note: Assessment of plays creating shots is subjective and based on domain knowledge specific to the sport of soccer

Data Cleaning

Working Data

Previously, in Women’s Soccer Expected Goals — Data Extraction, shot event data and key pass features were extracted for target women’s club competitions, competition ids, and season ids within StatsBomb Open Data.

extracted_data.head()
print("Total Women's Club Shot Events:", len(extracted_data))Total Women's Club Competition Shot Events: 6114print('Updated Shot w/ Key Pass Features:', extracted_data.shape[1])Updated Shot w/ Key Pass Features: 81

Review from Parts-One-through-Four

Previously, in Women’s Soccer Expected Goals — Data Cleaning — Part One: Defining the Target Feature, shot_outcome was updated from a categorical feature to the boolean feature goal, ‘True’ if a shot resulted in a goal, ‘False’ if not. The new feature, goal represents the target feature for the eventual supervised classification modeling.

extracted_data['goal'].value_counts()False    5416
True 664
Name: goal, dtype: bool
print("goal Percent 'True':", round((((sum(extracted_data['goal'])) / (len(extracted_data))) * 100), 2), '%')goal Percent 'True': 10.92 %

Next, in Women’s Soccer Expected Goals — Data Cleaning — Part Two: Irrelevant Data, irrelevant features were dropped:

  • Features duplicated as a result of extracting and concatenating shot event and the pass event data
  • Features deemed as not characteristics of shots or plays creating shots
extracted_data.shape[1]21list(extracted_data.columns.values)['period_x',
'timestamp_x',
'play_pattern_x',
'location_x',
'under_pressure_x',
'shot_statsbomb_xg',
'shot_end_location',
'shot_technique',
'goal',
'shot_type',
'shot_body_part',
'shot_one_on_one',
'shot_open_goal',
'shot_first_time',
'shot_redirect',
'shot_deflected',
'shot_follows_dribble',
'pass_length',
'pass_angle',
'pass_height',
'pass_type',
'pass_switch',
'pass_through_ball',
'pass_technique',
'pass_backheel',
'pass_cross',
'counterpress',
'pass_cut_back',
'pass_inswinging',
'pass_straight',
'pass_outswinging']

Next, in Women’s Soccer Expected Goals — Data Cleaning — Part Three: Correcting Datatypes, the data types of the features were assessed and corrected:

  • Features identified as boolean in nature were updated from object to boolean datatype
  • timestamp_x was updated from object to datetime datatype
extracted_data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 6080 entries, 0 to 6079
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 period_x 6080 non-null int64
1 timestamp_x 6080 non-null datetime64[ns]
2 play_pattern_x 6080 non-null object
3 location_x 6080 non-null object
4 under_pressure_x 6080 non-null bool
5 shot_statsbomb_xg 6080 non-null float64
6 shot_end_location 6080 non-null object
7 shot_technique 6080 non-null object
8 goal 6080 non-null bool
9 shot_type 6080 non-null object
10 shot_body_part 6080 non-null object
11 shot_one_on_one 6080 non-null bool
12 shot_open_goal 6080 non-null bool
13 shot_first_time 6080 non-null bool
14 shot_redirect 6080 non-null bool
15 shot_deflected 6080 non-null bool
16 shot_follows_dribble 6080 non-null bool
17 pass_length 4138 non-null float64
18 pass_angle 4138 non-null float64
19 pass_height 4138 non-null object
20 pass_type 960 non-null object
21 pass_switch 6080 non-null bool
22 pass_through_ball 6080 non-null bool
23 pass_technique 355 non-null object
24 pass_backheel 6080 non-null bool
25 pass_cross 6080 non-null bool
26 counterpress 6080 non-null bool
27 pass_cut_back 6080 non-null bool
28 pass_inswinging 6080 non-null bool
29 pass_straight 6080 non-null bool
30 pass_outswinging 6080 non-null bool
dtypes: bool(17), datetime64[ns](1), float64(3), int64(1), object(9)
memory usage: 813.4+ KB

Next, in Soccer Expected Goals — Data Cleaning — Part-Four: Processing Missing Values, missing values were identified, assessed and processed.

pass_length, pass_angle, and pass_height each had 1,942 missing values. It was assumed that these shot events were not preceded by a pass, ‘no pass shots’:

print('pass_length NA:', sum(extracted_data['pass_length'].isna()), '\n', 'pass_angle NA:', sum(extracted_data['pass_angle'].isna()), '\n', 'pass_height NA:', sum(extracted_data['pass_height'].isna()))pass_length NA: 1942 
pass_angle NA: 1942
pass_height NA: 1942
  • Numerical pass-related features were replaced with 0
  • Categorical pass-related features were replaced with ‘No Pass’
  • Boolean pass-related features did not require filling as they already indicated ‘False’

Additional missing values for pass_type were assumed to be passes from open-play and filled with ‘Open Play.’

extracted_data['pass_type'].value_counts(dropna = False)Open Play       3177
No Pass 1943
Corner 400
Recovery 305
Free Kick 201
Throw-in 42
Interception 10
Kick Off 1
Goal Kick 1
Name: pass_type, dtype: int64

Additional missing values for pass_technique were assumed to be standard passes and were filled with ‘Standard.’

extracted_data['pass_technique'].value_counts(dropna = False)Standard        3782
No Pass 1943
Through Ball 198
Inswinging 76
Outswinging 55
Straight 26
Name: pass_technique, dtype: int64
extracted_data.isnull().sum()period_x 0
timestamp_x 0
play_pattern_x 0
location_x 0
under_pressure_x 0
shot_statsbomb_xg 0
shot_end_location 0
shot_technique 0
goal 0
shot_type 0
shot_body_part 0
shot_one_on_one 0
shot_open_goal 0
shot_first_time 0
shot_redirect 0
shot_deflected 0
shot_follows_dribble 0
pass_length 0
pass_angle 0
pass_height 0
pass_type 0
pass_switch 0
pass_through_ball 0
pass_technique 0
pass_backheel 0
pass_cross 0
counterpress 0
pass_cut_back 0
pass_inswinging 0
pass_straight 0
pass_outswinging 0
dtype: int64

Working data prior to step-five:

print("Total Women's Club Shot Events:", len(extracted_data))Total Women's Club Competition Shot Events: 6114print('Updated Shot w/ Key Pass Features:', extracted_data.shape[1])Updated Shot w/ Key Pass Features: 31extracted_data.head()

Location Features

The shot event data contains two location-descriptive features, both of which include multiple coordinates:

  • location_x describes the location on the field from which the shot was taken and contains x and y-coordinates
extracted_data['location_x'].head()0    [109.0, 46.0]
1 [113.0, 35.0]
2 [94.0, 43.0]
3 [86.0, 34.0]
4 [94.0, 33.0]
Name: location_x, dtype: object
  • shot_end_location describes the location on the field where the shot’s path ended and contains x, y, and z-coordinates
extracted_data['shot_end_location'].head()0         [112.0, 45.0]
1 [120.0, 32.9, 0.4]
2 [120.0, 42.8, 0.5]
3 [119.0, 33.3, 0.5]
4 [120.0, 34.8, 0.5]
Name: shot_end_location, dtype: object

location_x

Split location_x into separate x and y-coordinates

shot_location_df = pd.DataFrame(extracted_data['location_x'].tolist(), index = extracted_data.index)shot_location_df.head()

Replace location_x with shot_location_x and shot_location_y

extracted_data.drop('location_x', axis = 1, inplace = True)extracted_data['shot_location_y'] = shot_location_df[0]
extracted_data['shot_location_x'] = shot_location_df[1]
extracted_data[['shot_location_y', 'shot_location_x']].head()
extracted_data['shot_location_x'].describe()count    6043.000000
mean 40.286745
std 9.782598
min 12.400000
25% 33.000000
50% 40.100000
75% 47.000000
max 68.200000
Name: shot_location_x, dtype: float64
extracted_data['shot_location_y'].describe()count 6043.000000
mean 104.003690
std 8.819068
min 77.800000
25% 97.700000
50% 105.600000
75% 111.000000
max 120.000000
Name: shot_location_y, dtype: float64

shot_end_location

Split shot_end_location into separate x and y-coordinates

end_location_df = pd.DataFrame(extracted_data['shot_end_location'].tolist(), index = extracted_data.index)end_location_df.head()

Assess the values of shot_end_location’s z-coordinate

print('shot_end_location z-coordinate NA:', (sum(end_location_df[2].isna())), '\n', 'Percent shot_end_location z-coordinate NA:', (round((((sum(end_location_df[2].isna())) / (len(extracted_data))) * 100), 2)), '%')shot_end_location z-coordinate NA: 1796 
Percent shot_end_location z-coordinate NA: 29.54 %

Drop shot_end_location z-coordinate due to too many missing values

Replace shot_end_location with shot_end_location_x, shot_end_location_y, and shot_end_location_z

extracted_data.drop('shot_end_location', axis = 1, inplace = True)extracted_data['end_location_y'] = end_location_df[0]
extracted_data['end_location_x'] = end_location_df[1]
extracted_data[['end_location_y', 'end_location_x']].head()
extracted_data['end_location_x'].describe()count    6043.000000
mean 40.155453
std 6.314044
min 0.100000
25% 36.400000
50% 40.000000
75% 43.800000
max 80.000000
Name: end_location_x, dtype: float64
extracted_data['end_location_y'].describe()count 6043.000000
mean 116.005527
std 6.252442
min 84.000000
25% 115.000000
50% 119.000000
75% 120.000000
max 120.000000
Name: end_location_y, dtype: float64

Results

The values for the two location-descriptive features, location_x and shot_end_location, were split into separate coordinates. The location-descriptive features were then dropped and replaced with new numeric features for the separate coordinate values:

  • shot_location_x
  • shot_location_y
  • end_location_x
  • end_location_y

Note: The z-coordinate for shot_end_location was dropped.

Note: The previous location-descriptive features contained object datatypes. The new coordinate features float64 datatypes.

print("Total Women's Club Shot Events:", len(extracted_data))Total Women's Club Competition Shot Events: 6114print('Updated Shot w/ Key Pass Features:', extracted_data.shape[1])Updated Shot w/ Key Pass Features: 33extracted_data.head()
extracted_data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 6080 entries, 0 to 6079
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 period_x 6080 non-null int64
1 timestamp_x 6080 non-null datetime64[ns]
2 play_pattern_x 6080 non-null object
3 under_pressure_x 6080 non-null bool
4 shot_statsbomb_xg 6080 non-null float64
5 shot_technique 6080 non-null object
6 goal 6080 non-null bool
7 shot_type 6080 non-null object
8 shot_body_part 6080 non-null object
9 shot_one_on_one 6080 non-null bool
10 shot_open_goal 6080 non-null bool
11 shot_first_time 6080 non-null bool
12 shot_redirect 6080 non-null bool
13 shot_deflected 6080 non-null bool
14 shot_follows_dribble 6080 non-null bool
15 pass_length 6080 non-null float64
16 pass_angle 6080 non-null float64
17 pass_height 6080 non-null object
18 pass_type 6080 non-null object
19 pass_switch 6080 non-null bool
20 pass_through_ball 6080 non-null bool
21 pass_technique 6080 non-null object
22 pass_backheel 6080 non-null bool
23 pass_cross 6080 non-null bool
24 counterpress 6080 non-null bool
25 pass_cut_back 6080 non-null bool
26 pass_inswinging 6080 non-null bool
27 pass_straight 6080 non-null bool
28 pass_outswinging 6080 non-null bool
29 shot_location_y 6080 non-null float64
30 shot_location_x 6080 non-null float64
31 end_location_y 6080 non-null float64
32 end_location_x 6080 non-null float64
dtypes: bool(17), datetime64[ns](1), float64(7), int64(1), object(7)
memory usage: 908.4+ KB

Continued

Part-six continues the series for the data cleaning portion of the data science workflow creating a women’s club soccer expected goals (xG) classification model, explaining the process of identifying, assessing, and processing outliers.

More

If you liked this post, please give it an applause and follow me as I will be continuing with a series of posts for each process through the data science workflow of my Women’s Soccer Expected Goals Model:

Also, follow me on Twitter, where I post regularly about tactical observations for soccer:

I would love to read any feedback you might have in the comments.

--

--