Soccer Expected Goals — Data Cleaning — Part-Four: Processing Missing Values

Wes Swager
10 min readOct 27, 2021

--

Part-four for the data cleaning portion of the data science workflow creating a soccer expected goals classification model, explaining the process of identifying, assessing, and processing missing values.

Expected Goals (xG)

xG measures the quality of a shot.

xG, as a metric, indicates the likelihood that a shot will result in a goal based on the characteristics of the shot and the play preceding the shot.

xG is measured on a scale between zero and one, with one representing a goal. For example, a shot with a 0.5 xG indicates a shot having a 50% chance of being a goal.

Classification Model

The metric of expected goals is calculated through the use of a classification model.

A classification model refers to a predictive modeling problem where a class label is predicted for a given example of input data.

For this project a supervised approach was used with the training data for the model including which shots were goals.

The Data

The data for this project was extracted from StatsBomb’s Open Data.

StatsBomb are a United Kingdom based football (soccer) data analytics company. StatsBomb provide free access to a segment of their proprietary dataset via GitHub: StatsBomb Open Data

Data Cleaning

Data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

Wikipedia

Data Cleaning: Identifying Missing Values

Shot events data was previously extracted as part of the data extraction process (see Background — Previous Steps above).

Note: extracted_data contains 6,080 events with 81 features

Search current value counts

extracted_data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 6080 entries, 0 to 6079
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 period_x 6080 non-null int64
1 timestamp_x 6080 non-null datetime64[ns]
2 play_pattern_x 6080 non-null object
3 location_x 6080 non-null object
4 under_pressure_x 6080 non-null bool
5 shot_statsbomb_xg 6080 non-null float64
6 shot_end_location 6080 non-null object
7 shot_technique 6080 non-null object
8 goal 6080 non-null bool
9 shot_type 6080 non-null object
10 shot_body_part 6080 non-null object
11 shot_one_on_one 6080 non-null bool
12 shot_open_goal 6080 non-null bool
13 shot_first_time 6080 non-null bool
14 shot_redirect 6080 non-null bool
15 shot_deflected 6080 non-null bool
16 shot_follows_dribble 6080 non-null bool
17 pass_length 4138 non-null float64
18 pass_angle 4138 non-null float64
19 pass_height 4138 non-null object
20 pass_type 960 non-null object
21 pass_switch 6080 non-null bool
22 pass_through_ball 6080 non-null bool
23 pass_technique 355 non-null object
24 pass_backheel 6080 non-null bool
25 pass_cross 6080 non-null bool
26 counterpress 6080 non-null bool
27 pass_cut_back 6080 non-null bool
28 pass_inswinging 6080 non-null bool
29 pass_straight 6080 non-null bool
30 pass_outswinging 6080 non-null bool
dtypes: bool(17), datetime64[ns](1), float64(3), int64(1), object(9)
memory usage: 813.4+ KB

Upon initial review, all of the features with missing values are pass-related (features extracted and concatenated from pass events which were identified as key passes for shot events, passes immediately preceding the shot (assist if the shot resulted in a goal)(previously explained in Soccer Expected Goals — Data Extraction — Part-Two: Key Pass Features).

No Pass Shots

Note: A ‘no pass shot’ will refer to a shot event which has been deemed to have not been preceded by a pass (unassisted if the shot were to result in a goal).

print('pass_length NA:', sum(extracted_data['pass_length'].isna()), '\n', 'pass_angle NA:', sum(extracted_data['pass_angle'].isna()), '\n', 'pass_height NA:', sum(extracted_data['pass_height'].isna()))pass_length NA: 1942 
pass_angle NA: 1942
pass_height NA: 1942

The first observation is that pass_length, pass_angle, and pass_height each have the same number of missing values, 1,942.

print('pass_length NA =/= pass_angle NA:', (1942 - (sum(extracted_data.loc[extracted_data['pass_length'].isna()]['pass_angle'].isna()))), '\n', 'pass_length NA =/= pass_height NA:', (1942 - (sum(extracted_data.loc[extracted_data['pass_length'].isna()]['pass_height'].isna()))), '\n', 'pass_angle NA =/= pass_height NA:', (1942 - (sum(extracted_data.loc[extracted_data['pass_angle'].isna()]['pass_height'].isna()))))pass_length NA =/= pass_angle NA: 0 
pass_length NA =/= pass_height NA: 0
pass_angle NA =/= pass_height NA: 0

Upon further investigation, the 1,942 missing values for pass_length, pass_angle, and pass_height are the same events.

print('Percent shot events w/ no pass:', (round((((sum(extracted_data['pass_length'].isna())) / (len(extracted_data))) * 100), 2)), '%')Percent shot events w/ no pass: 31.94 %

The 1,942 missing values for pass_length, pass_angle, and pass_height represent 31.94% of the events, which is too significant to drop.

It can be assumed these shot events are no pass shots.

Note: This observation is subjective, based on domain knowledge specific to the sport of soccer.

A shot not being preceded by a pass is significantly descriptive of the play preceding the shot in a matter which adds valuable context regarding the potential quality of the shot. This further supports that the values should not be dropped

The features, pass_length, pass_angle, and pass_height, are significantly descriptive of the play preceding the shot in a matter which adds valuable context regarding the potential quality of the shot. Therefore, these features should not be dropped.

Because the values and features should not be dropped, the values will need to be replaced.

Numerical No Pass Shots

pass_length and pass_angle are numerical, and will therefore require numerical filling.

A 0 in these features indicates that the ball was not passed any length, at any angle to reach the point of the shot, implying no preceding pass.

Replace pass_length and pass_angle missing values with 0

extracted_data['pass_length'].fillna(0, inplace = True)
extracted_data['pass_angle'].fillna(0, inplace = True)

Categorical No Pass Shots

Compare shot events with no pass v pass-related categorical feature values

extracted_data.loc[extracted_data['pass_length'] == 0]['pass_type'].value_counts(dropna = False)NaN    1943
Name: pass_type, dtype: int64
extracted_data.loc[extracted_data['pass_length'] == 0]['pass_technique'].value_counts(dropna = False)NaN 1943
Name: pass_type, dtype: int64

Note: It appears one shot event already had values of 0 for pass_length and pass_angle, indicating this shot event was not preceded by a pass but was recorded differently, with 0 values instead of NaN values. Therefore the count for no pass events has increased to 1,943.

All of the shot events previously identified as no pass shots are also missing values for pass_height, pass_type and pass_technique. Because these features are categorical, the missing values of the no pass events will be filled with the value ‘No Pass.’

Note: This will not replace all missing values for these features, just previously identified no pass shots.

extracted_data.loc[extracted_data['pass_length'] == 0, ['pass_height', 'pass_type', 'pass_technique']] = 'No Pass'

Boolean No Pass Shots

The following are pass-related features with boolean values:

  • pass_switch
  • pass_through_ball
  • pass_backheel
  • pass_cross
  • pass_cut_back
  • pass_inswinging
  • pass_straight
  • pass_outswinging’

Assess values of boolean features no pass shots

boolean_pass_features = ['pass_switch', 'pass_through_ball', 'pass_backheel', 'pass_cross', 'pass_cut_back', 'pass_inswinging', 'pass_straight', 'pass_outswinging']extracted_data.loc[extracted_data['pass_length'] == 0][boolean_pass_features].value_counts()pass_switch  pass_through_ball  pass_backheel  pass_cross
False False False False
pass_cut_back pass_inswinging pass_straight pass_outswinging
False False False False
1943
dtype: int64

All of the shot events previously identified as no pass shots currently have the value ‘False’ for the pass-related boolean features. This is technically accurate as an event with no pass would not have the characteristics specified by these features. For this reason, no filling is required for these features.

Additional Missing Values

extracted_data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 6080 entries, 0 to 6079
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 period_x 6080 non-null int64
1 timestamp_x 6080 non-null datetime64[ns]
2 play_pattern_x 6080 non-null object
3 location_x 6080 non-null object
4 under_pressure_x 6080 non-null bool
5 shot_statsbomb_xg 6080 non-null float64
6 shot_end_location 6080 non-null object
7 shot_technique 6080 non-null object
8 goal 6080 non-null bool
9 shot_type 6080 non-null object
10 shot_body_part 6080 non-null object
11 shot_one_on_one 6080 non-null bool
12 shot_open_goal 6080 non-null bool
13 shot_first_time 6080 non-null bool
14 shot_redirect 6080 non-null bool
15 shot_deflected 6080 non-null bool
16 shot_follows_dribble 6080 non-null bool
17 pass_length 6080 non-null float64
18 pass_angle 6080 non-null float64
19 pass_height 6080 non-null object
20 pass_type 2903 non-null object
21 pass_switch 6080 non-null bool
22 pass_through_ball 6080 non-null bool
23 pass_technique 2298 non-null object
24 pass_backheel 6080 non-null bool
25 pass_cross 6080 non-null bool
26 counterpress 6080 non-null bool
27 pass_cut_back 6080 non-null bool
28 pass_inswinging 6080 non-null bool
29 pass_straight 6080 non-null bool
30 pass_outswinging 6080 non-null bool
dtypes: bool(17), datetime64[ns](1), float64(3), int64(1), object(9)
memory usage: 813.4+ KB

pass_type and pass_technique have remaining missing values after filling no pass shots.

pass_type

extracted_data['pass_type'].value_counts(dropna = False)NaN             3177
No Pass 1943
Corner 400
Recovery 305
Free Kick 201
Throw-in 42
Interception 10
Kick Off 1
Goal Kick 1
Name: pass_type, dtype: int64

The values defined for pass_type are either:

Set-pieces: Plays restarting play from a stoppage

  • ‘Corner’
  • ‘Free Kick’
  • ‘Throw-In’
  • ‘Kick Off’
  • ‘Goal Kick’

-or-

Defensive Recoveries: Plays directly resulting from and immediately preceeded by the in-possession team having won possession from the opposition-team

  • ‘Recovery’
  • ‘Interception’

It can be assumed the missing values were shots in which the preceding pass was neither a set-piece nor a defensive recovery, and can simply be considered as from open-play. Therefore the additional missing values will be filled with the categorical value ‘Open Play.’

Note: This observation is subjective, based on domain knowledge specific to the sport of soccer.

Replace remaining pass_type missing values with ‘Open Play’

extracted_data['pass_type'].fillna('Open Play', inplace = True)

pass_technique

extracted_data['pass_technique'].value_counts(dropna = False)
NaN 3782
No Pass 1943
Through Ball 198
Inswinging 76
Outswinging 55
Straight 26
Name: pass_technique, dtype: int64

It can be assumed the missing values were shots in which the preceding pass was not a specialized type of pass, and can simply be considered a standard pass. Therefore the additional missing values will be filled with the categorical value ‘Standard Pass.’

Note: This observation is subjective, based on domain knowledge specific to the sport of soccer.

Replace remaining pass_technique missing values with ‘Standard Pass’

extracted_data['pass_technique'].fillna('Standard', inplace = True)

Results

extracted_data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 6080 entries, 0 to 6079
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 period_x 6080 non-null int64
1 timestamp_x 6080 non-null datetime64[ns]
2 play_pattern_x 6080 non-null object
3 location_x 6080 non-null object
4 under_pressure_x 6080 non-null bool
5 shot_statsbomb_xg 6080 non-null float64
6 shot_end_location 6080 non-null object
7 shot_technique 6080 non-null object
8 goal 6080 non-null bool
9 shot_type 6080 non-null object
10 shot_body_part 6080 non-null object
11 shot_one_on_one 6080 non-null bool
12 shot_open_goal 6080 non-null bool
13 shot_first_time 6080 non-null bool
14 shot_redirect 6080 non-null bool
15 shot_deflected 6080 non-null bool
16 shot_follows_dribble 6080 non-null bool
17 pass_length 6080 non-null float64
18 pass_angle 6080 non-null float64
19 pass_height 6080 non-null object
20 pass_type 6080 non-null object
21 pass_switch 6080 non-null bool
22 pass_through_ball 6080 non-null bool
23 pass_technique 6080 non-null object
24 pass_backheel 6080 non-null bool
25 pass_cross 6080 non-null bool
26 counterpress 6080 non-null bool
27 pass_cut_back 6080 non-null bool
28 pass_inswinging 6080 non-null bool
29 pass_straight 6080 non-null bool
30 pass_outswinging 6080 non-null bool
dtypes: bool(17), datetime64[ns](1), float64(3), int64(1), object(9)
memory usage: 973.4+ KB

Conclusion

As part-four for the data cleaning portion of the data science workflow creating a soccer expected goals classification model, missing values were identified, assessed, and processed.

First, missing values were identified throughout the dataset. All of the features with missing values were pass-related.

In assessing the missing values, it was noted that pass_length, pass_angle, and pass_height each had 1,942 missing values, and, after further investigation, it was assumed that these shot events were not preceded by a pass, ‘no pass shots’:

  • Numerical pass-related features for no pass shots were replaced with the value 0.
  • Categorical pass-related features for no pass shots were replaced with the value ‘No Pass.’
  • Boolean pass-related features for no pass shots did not require filling.

Following the filling of no pass shots, pass_type and pass_technique had additional missing values:

  • pass_type missing values were assumed to be passes from open-play and were therefore filled with ‘Open Play.’
  • pass_technique missing values were assumed to be standard passes and were therefore filled with ‘Standard.’

The final data contains no missing values, which will allow for better interpretation during the eventual modeling process.

Continued

This series explaining the processes within the data science workflow creating a soccer expected goals classification model, will be continued in part-five of the data cleaning process, explaining the process of splitting the location feature coordinates:

--

--