Soccer Expected Goals — Data Cleaning — Part Two: Identifying and Dropping Irrelevant Data

Wes Swager
8 min readOct 11, 2021

--

Part-two for the data cleaning portion of the data science workflow creating a soccer expected goals classification model, explaining the process of identifying and dropping irrelevant data.

Expected Goals (xG)

xG measures the quality of a shot.

xG, as a metric, indicates the likelihood that a shot will result in a goal based on the characteristics of the shot and the play preceding the shot.

xG is measured on a scale between zero and one, with one representing a goal. For example, a shot with a 0.5 xG indicates a shot having a 50% chance of being a goal.

Classification Model

The metric of expected goals is calculated through the use of a classification model.

A classification model refers to a predictive modeling problem where a class label is predicted for a given example of input data.

For this project a supervised approached was used with the training data for the model including which shots were goals.

The Data

The data for this project was extracted from StatsBomb’s Open Data.

StatsBomb are a United Kingdom based football (soccer) data analytics company. StatsBomb provide free access to a segment of their proprietary dataset via GitHub: StatsBomb Open Data

Data Cleaning

Data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

Wikipedia

Data Cleaning: Irrelevant Data

Shot events data was previously extracted as part of the data extraction process (see Background — Previous Steps above).

Note: extracted_data contains 6,080 events with 81 features

Initial Features

The data included the following 81 features following the data extraction process:

extracted_data.shape[1]81
list(extracted_data.columns.values)
'id',
'index_x',
'period_x',
'timestamp_x',
'minute_x',
'second_x',
'type_x',
'possession_x',
'possession_team_x',
'play_pattern_x',
'team_x',
'player_x',
'position_x',
'location_x',
'duration_x',
'under_pressure_x',
'related_events_x',
'match_id_x',
'shot_statsbomb_xg',
'shot_end_location',
'shot_key_pass_id',
'shot_technique',
'goal',
'shot_type',
'shot_body_part',
'shot_freeze_frame',
'shot_one_on_one',
'shot_aerial_won',
'shot_open_goal',
'shot_first_time',
'out_x',
'shot_redirect',
'shot_deflected',
'off_camera_x',
'shot_saved_off_target',
'shot_saved_to_post',
'shot_follows_dribble',
'index_y',
'period_y',
'timestamp_y',
'minute_y',
'second_y',
'type_y',
'possession_y',
'possession_team_y',
'play_pattern_y',
'team_y',
'player_y',
'position_y',
'location_y',
'duration_y',
'related_events_y',
'match_id_y',
'pass_recipient',
'pass_length',
'pass_angle',
'pass_height',
'pass_end_location',
'pass_body_part',
'pass_type',
'under_pressure_y',
'pass_outcome',
'pass_aerial_won',
'pass_assisted_shot_id',
'pass_shot_assist',
'off_camera_y',
'pass_switch',
'pass_through_ball',
'pass_technique',
'pass_backheel',
'pass_cross',
'counterpress',
'pass_cut_back',
'pass_deflected',
'pass_goal_assist',
'pass_miscommunication',
'pass_inswinging',
'pass_straight',
'pass_outswinging',
'pass_no_touch',
'out_y'

Duplicate Features

Upon initial review it is apparent that a number of the features extracted were duplicated as a result of extracting data for both shot events and the pass events preceding the shot events.

Note: Conveniently, while concatenating the two datasets, pandas helpfully indicated columns with duplicate names with x and y suffixes.

The first-step in cleaning the irrelevant features is dropping the duplicate features:

duplicate_features = ['shot_saved_off_target',
'shot_saved_to_post',
'pass_outcome',
'pass_assisted_shot_id',
'pass_shot_assist',
'pass_goal_assist',
'pass_end_location',
'index_y',
'period_y',
'timestamp_y',
'minute_x',
'second_x',
'minute_y',
'second_y',
'type_y',
'possession_y',
'possession_team_y',
'play_pattern_y',
'team_y',
'player_y',
'position_y',
'location_y',
'duration_y',
'related_events_y',
'match_id_y',
'under_pressure_y',
'off_camera_y',
'out_y']
extracted_data.drop(duplicate_features, axis = 1, inplace = True)

Non-Shot-Specific Features

The second observation is that a number of features are not descriptive characteristics of the actual shot and/or not descriptive of the play preceding the shot in a matter which adds valuable context regarding the potential quality of the shot. Therefore, these features will not be beneficial toward the development of the eventual model.

Note: These observations are subjective, based on domain knowledge specific to the sport of soccer.

Features Indicative of the Data’s Organizational Structure:

  • id: A unique identification number for the event.
  • match_id_x: A unique identification number for the match in which the event took place.
  • index_x: An identifier indicating the events’ location within StatsBombs’ nested data structure.
  • type_x: Categorizes various event types. (Only shot events have been extracted, therefore, the only event type is ‘shot.’)

These features are useful toward navigating and sorting the larger dataset. However, these features are not descriptive characteristics of the actual shot and/or not descriptive of the play preceding the shot in a matter which adds valuable context regarding the potential quality of the shot.

Features Referencing Video Recording

  • shot_freeze_frame: Describes how to locate the event in video recording.
  • off_camera_x: Indicates events which occurred off-camera

These features are useful toward connecting the data associated with an event with video recordings of the event for further analysis. However, these features are not descriptive characteristics of the actual shot and/or not descriptive of the play preceding the shot in a matter which adds valuable context regarding the potential quality of the shot.

Note: These features could potentially have benefit toward adding additional features not currently recorded by StatsBomb through analysis of the video recording.

Features Referencing Other Events

  • shot_key_pass_id: Indicates the id values of pass events for passes which directly preceded the shot event (potential assist if the shot were a goal). (Note: During the data extraction process, the values of this feature were used to extract data for the specified pass events which was concatenated with the shot events in order to provide more data specific to the play preceding the shot.)
  • related_events_x: Indicates other events significantly connected to the shot event. (Note: During the data extraction process, the values of this feature were compared with dribble event ids, however, no matches were found. No other event types appear significant toward adding valuable context regarding the potential quality of the shot.)

These features were useful toward extracting (or attempting to extract) additional data regarding the play preceding the shot event during the data extraction process. However, these features are not descriptive characteristics of the actual shot and/or not descriptive of the play preceding the shot in a matter which adds valuable context regarding the potential quality of the shot.

Features Descriptive of the Game-State

  • possession_x: Indicates a uniquely numbered possession within the game which the event was a part of. A possession is a period of time in which a team is in-possession of the ball. Within the context of the data, possessions begin when a team wins or receives the ball and end when the opposite team wins or receives the ball with all events occurring between assigned that possession value.
  • possession_team_x: Indicates which of the two teams was in-possession of the ball during the event.
  • team_x: Indicates the team which executed the event. (Because all of the events are shot events, the team executing the event would be the same as possession_team_x.)
  • player_x: Indicates the player who executed the event.
  • position_x: Indicates the position the player who executed the event plays.

These features are useful toward describing the state of the game when the event occurred. However, these features are not descriptive characteristics of the actual shot and/or not descriptive of the play preceding the shot in a matter which adds valuable context regarding the potential quality of the shot.

Features Specific to the Pass Preceding the Shot, but not the Shot

  • pass_recipient: Indicates the player who received the pass of a pass event. (Because pass event data was concatenated with the shot events they preceded, the pass_recipient would be the same as player_x.)
  • pass_body_part: Indicates the body part used to make the pass.
  • pass_aerial_won: Indicates if the pass was made as part of the action of winning the ball out of the air.
  • pass_deflected: Indicates if the pass was deflected while traveling between the passer and intended receiver.
  • pass_miscommunication: Indicates if there was a miscommunication between the passer and intended receiver.
  • pass_no_touch: Indicates if the pass was made without the passer taking a preceding touch or touches/the pass was made with the passers first and only touch.

These features are descriptive of the pass preceding the shot, but in ways which do not add valuable context regarding the potential quality of the shot.

Features Descriptive of the Shot, but not with Useful Data

  • duration_x: Indicates how much time the event required.
  • out_x: Indicates if the shot ended out of bounds. (This is redundant compared with the class label of ‘goal,’ a shot ending out of bounds does not provide additional information v a shot being classified as not a goal.)

These features are descriptive characteristics of the actual shot, but will not be beneficial toward the development of the eventual model.

The second-step in cleaning the irrelevant features is dropping the non-shot-specific features:

non_shot_specific_features = ['id',
'index_x',
'type_x',
'possession_x',
'possession_team_x',
'team_x',
'player_x',
'position_x',
'duration_x',
'related_events_x',
'match_id_x',
'shot_key_pass_id',
'shot_freeze_frame',
'out_x',
'off_camera_x',
'shot_aerial_won',
'pass_recipient',
'pass_body_part',
'pass_aerial_won',
'pass_deflected',
'pass_miscommunication',
'pass_no_touch']
extracted_data.drop(unrelated_features, axis = 1, inplace = True)

Final Set of Features

After dropping the duplicate and non-shot-specific the following list of 21 features remains:

extracted_data.shape[1]21
list(extracted_data.columns.values)['period_x',
'timestamp_x',
'play_pattern_x',
'location_x',
'under_pressure_x',
'shot_statsbomb_xg',
'shot_end_location',
'shot_technique',
'goal',
'shot_type',
'shot_body_part',
'shot_one_on_one',
'shot_open_goal',
'shot_first_time',
'shot_redirect',
'shot_deflected',
'shot_follows_dribble',
'pass_length',
'pass_angle',
'pass_height',
'pass_type',
'pass_switch',
'pass_through_ball',
'pass_technique',
'pass_backheel',
'pass_cross',
'counterpress',
'pass_cut_back',
'pass_inswinging',
'pass_straight',
'pass_outswinging']

Conclusion

As part-two for the data cleaning portion of the data science workflow creating a Soccer Expected Goals Classification Model, irrelevant data was dropped from the dataframe.

First, features duplicated as a result of extracting data for both shot events and the pass events preceding the shot events were identified and dropped.

Second, features which are not descriptive characteristics of the actual shot and/or are not descriptive of the play preceding the shot in a matter which adds valuable context regarding the potential quality of the shot, and therefore, not beneficial toward the development of the eventual model, were subjectively rationalized and dropped.

The final set of data is significantly streamlined and concise toward the eventual goal of the model.

Continued

This series explaining the processes within the data science workflow creating a soccer expected goals classification model, will be continued in part-three of the data cleaning process, explaining the process of correcting the datatypes:

--

--