Soccer Expected Goals — Data Cleaning — Part One: Defining the Target Feature

Wes Swager
3 min readOct 22, 2021

--

Part-one for the data cleaning portion of the data science workflow creating a soccer expected goals classification model, explaining the process of defining the target feature for the eventual model, goal.

Background

Previous Steps

The data extraction process was previously explained in a series of posts:

Expected Goals (xG)

xG measures the quality of a shot.

xG, as a metric, indicates the likelihood that a shot will result in a goal based on the characteristics of the shot and the play preceding the shot.

xG is measured on a scale between zero and one, with one representing a goal. For example, a shot with a 0.5 xG indicates a shot having a 50% chance of being a goal.

Classification Model

The metric of expected goals is calculated through the use of a classification model.

A classification model refers to a predictive modeling problem where a class label is predicted for a given example of input data.

For this project a supervised approached was used with the training data for the model including which shots were goals.

The Data

The data for this project was extracted from StatsBomb’s Open Data.

StatsBomb are a United Kingdom based football (soccer) data analytics company. StatsBomb provide free access to a segment of their proprietary dataset via GitHub: StatsBomb Open Data

Data Cleaning

Data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

Wikipedia

Data Cleaning: Target Feature

Because xG indicates the likelihood that a shot will result in a goal, the target feature for the model is goal, as a boolean feature, with ‘True’ if a shot event resulted in a goal, ‘False’ if a shot event did not result in a goal.

Because the eventual modelling process will be a supervised classification model, the shot events will be labeled with the target feature, goal, to train the models, using the features of the shot events, so that they can then predict goals.

Shot events data was previously extracted as part of the data extraction process (see Background — Previous Steps above).

Note: extracted_data contains 6,080 events with 81 features

extracted_data currently contains the feature shot_outcome which includes categorical values

extracted_data['shot_outcome'].value_counts(dropna = False)Off T               1912
Saved 1531
Blocked 1460
Goal 664
Wayward 336
Post 136
Saved Off Target 24
Saved to Post 17
Name: shot_outcome, dtype: int64

Update shot_outcome values to boolean for goal or not a goal

extracted_data['shot_outcome'] = extracted_data['shot_outcome'].apply(lambda x: 1 if x == 'Goal' else 0)extracted_data['goal'] = extracted_data['shot_outcome'].astype(bool)

Update shot_outcome column name to goal

extracted_data.rename(columns = {'shot_outcome' : 'goal'}, inplace = True)

Results

extracted_data['goal'].value_counts()False    5416
True 664
Name: goal, dtype: bool
print("goal Percent 'True':", round((((sum(extracted_data['goal'])) / (len(extracted_data))) * 100), 2), '%')goal Percent 'True': 10.92 %

Conclusion

In the process of defining the target feature for the eventual model, as part of the data cleaning portion of the data science workflow creating a soccer expected goals classification model, shot_outcome was converted from categorical features to the boolean feature goal, with ‘True’ if a shot event resulted in a goal, ‘False’ if a shot event did not result in a goal.

Continued

This series explaining the processes within the data science workflow creating a soccer expected goals classification model, is continued in part-two of the data cleaning process, explaining the process of dropping irrelevant data:

--

--