Soccer Expected Goals — Data Cleaning — Part Three: Correcting Datatypes

Wes Swager
5 min readOct 22, 2021

--

Part-three for the data cleaning portion of the data science workflow creating a soccer expected goals classification model, explaining the process of correcting the datatypes.

Expected Goals (xG)

xG measures the quality of a shot.

xG, as a metric, indicates the likelihood that a shot will result in a goal based on the characteristics of the shot and the play preceding the shot.

xG is measured on a scale between zero and one, with one representing a goal. For example, a shot with a 0.5 xG indicates a shot having a 50% chance of being a goal.

Classification Model

The metric of expected goals is calculated through the use of a classification model.

A classification model refers to a predictive modeling problem where a class label is predicted for a given example of input data.

For this project a supervised approached was used with the training data for the model including which shots were goals.

The Data

The data for this project was extracted from StatsBomb’s Open Data.

StatsBomb are a United Kingdom based football (soccer) data analytics company. StatsBomb provide free access to a segment of their proprietary dataset via GitHub: StatsBomb Open Data

Data Cleaning

Data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

Wikipedia

Data Cleaning: Correcting Datatypes

Shot events data was previously extracted as part of the data extraction process (see Background — Previous Steps above).

Note: extracted_data contains 6,080 events with 81 features

Search current datatypes

extracted_data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 6080 entries, 0 to 6079
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 period_x 6080 non-null int64
1 timestamp_x 6080 non-null object
2 play_pattern_x 6080 non-null object
3 location_x 6080 non-null object
4 under_pressure_x 1045 non-null object
5 shot_statsbomb_xg 6080 non-null float64
6 shot_end_location 6080 non-null object
7 shot_technique 6080 non-null object
8 goal 6080 non-null bool
9 shot_type 6080 non-null object
10 shot_body_part 6080 non-null object
11 shot_one_on_one 330 non-null object
12 shot_open_goal 70 non-null object
13 shot_first_time 1290 non-null object
14 shot_redirect 22 non-null object
15 shot_deflected 64 non-null object
16 shot_follows_dribble 3 non-null object
17 pass_length 4138 non-null float64
18 pass_angle 4138 non-null float64
19 pass_height 4138 non-null object
20 pass_type 960 non-null object
21 pass_switch 323 non-null object
22 pass_through_ball 198 non-null object
23 pass_technique 355 non-null object
24 pass_backheel 14 non-null object
25 pass_cross 754 non-null object
26 counterpress 5 non-null object
27 pass_cut_back 108 non-null object
28 pass_inswinging 76 non-null object
29 pass_straight 26 non-null object
30 pass_outswinging 55 non-null object
dtypes: bool(1), float64(3), int64(1), object(26)
memory usage: 1.4+ MB

Boolean Features

Upon initial review it is apparent that a number of the features extracted are boolean in nature, but are currently object datatypes.

Define boolean features

boolean_features = ['shot_one_on_one',
'shot_open_goal',
'shot_first_time',
'shot_redirect',
'shot_deflected',
'shot_follows_dribble',
'under_pressure_x',
'counterpress',
'pass_switch',
'pass_through_ball',
'pass_backheel',
'pass_cross',
'pass_cut_back',
'pass_inswinging',
'pass_straight',
'pass_outswinging']

Convert boolean features to boolean datatype

extracted_data[boolean_features] = extracted_data[boolean_features].astype(bool)

Datetime

The second observation is that timestamp_x is currently object datatype, but should be datetime.

Convert timestamp_x datatype to datetime

extracted_data['timestamp_x'] = extracted_data['timestamp_x'].astype(str)extracted_data['timestamp_x'] = pd.to_datetime(extracted_data['timestamp_x'])

Results

extracted_data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 6080 entries, 0 to 6079
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 period_x 6080 non-null int64
1 timestamp_x 6080 non-null datetime64[ns]
2 play_pattern_x 6080 non-null object
3 location_x 6080 non-null object
4 under_pressure_x 6080 non-null bool
5 shot_statsbomb_xg 6080 non-null float64
6 shot_end_location 6080 non-null object
7 shot_technique 6080 non-null object
8 goal 6080 non-null bool
9 shot_type 6080 non-null object
10 shot_body_part 6080 non-null object
11 shot_one_on_one 6080 non-null bool
12 shot_open_goal 6080 non-null bool
13 shot_first_time 6080 non-null bool
14 shot_redirect 6080 non-null bool
15 shot_deflected 6080 non-null bool
16 shot_follows_dribble 6080 non-null bool
17 pass_length 4138 non-null float64
18 pass_angle 4138 non-null float64
19 pass_height 4138 non-null object
20 pass_type 960 non-null object
21 pass_switch 6080 non-null bool
22 pass_through_ball 6080 non-null bool
23 pass_technique 355 non-null object
24 pass_backheel 6080 non-null bool
25 pass_cross 6080 non-null bool
26 counterpress 6080 non-null bool
27 pass_cut_back 6080 non-null bool
28 pass_inswinging 6080 non-null bool
29 pass_straight 6080 non-null bool
30 pass_outswinging 6080 non-null bool
dtypes: bool(17), datetime64[ns](1), float64(3), int64(1), object(9)
memory usage: 813.4+ KB

Conclusion

As part-three for the data cleaning portion of the data science workflow creating a soccer expected goals classification model, datatypes were corrected.

First, features which are boolean in nature were changed from object to boolean datatype.

Second, timestamp_x was changed from object to datetime datatype.

The corrected data will allow for better interpretation during the eventual modeling process.

Continued

This series explaining the processes within the data science workflow creating a soccer expected goals classification model, will be continued in part-four of the data cleaning process, explaining the process of identifying and replacing missing values:

--

--