Soccer Expected Goals — Data Cleaning — Part Three: Correcting Datatypes

5 min readOct 22, 2021

Part-three for the data cleaning portion of the data science workflow creating a soccer expected goals classification model, explaining the process of correcting the datatypes.

Background

Previous Steps

The data extraction process was previously explained in a series of posts:

Previous processes of data cleaning were explained in:

Soccer Expected Goals — Data Cleaning — Part One: Defining the Target Feature

Explaining the process of defining the target feature for the eventual model, goal.

medium.com

Soccer Expected Goals — Data Cleaning — Part Two: Irrelevant Data

Explaining the process of dropping of irrelevant data.

medium.com

Expected Goals (xG)

xG measures the quality of a shot.

xG, as a metric, indicates the likelihood that a shot will result in a goal based on the characteristics of the shot and the play preceding the shot.

xG is measured on a scale between zero and one, with one representing a goal. For example, a shot with a 0.5 xG indicates a shot having a 50% chance of being a goal.

What is xG?

How the expected goals (xG) metric calculated and used within soccer.

medium.com

Classification Model

The metric of expected goals is calculated through the use of a classification model.

A classification model refers to a predictive modeling problem where a class label is predicted for a given example of input data.

For this project a supervised approached was used with the training data for the model including which shots were goals.

The Data

The data for this project was extracted from StatsBomb’s Open Data.

StatsBomb are a United Kingdom based football (soccer) data analytics company. StatsBomb provide free access to a segment of their proprietary dataset via GitHub: StatsBomb Open Data

Data Cleaning

Data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

Wikipedia

Data Cleaning: Correcting Datatypes

Shot events data was previously extracted as part of the data extraction process (see Background — Previous Steps above).

Note: extracted_data contains 6,080 events with 81 features

Search current datatypes

extracted_data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 6080 entries, 0 to 6079
Data columns (total 31 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   period_x              6080 non-null   int64  
 1   timestamp_x           6080 non-null   object 
 2   play_pattern_x        6080 non-null   object 
 3   location_x            6080 non-null   object 
 4   under_pressure_x      1045 non-null   object 
 5   shot_statsbomb_xg     6080 non-null   float64
 6   shot_end_location     6080 non-null   object 
 7   shot_technique        6080 non-null   object 
 8   goal                  6080 non-null   bool   
 9   shot_type             6080 non-null   object 
 10  shot_body_part        6080 non-null   object 
 11  shot_one_on_one       330 non-null    object 
 12  shot_open_goal        70 non-null     object 
 13  shot_first_time       1290 non-null   object 
 14  shot_redirect         22 non-null     object 
 15  shot_deflected        64 non-null     object 
 16  shot_follows_dribble  3 non-null      object 
 17  pass_length           4138 non-null   float64
 18  pass_angle            4138 non-null   float64
 19  pass_height           4138 non-null   object 
 20  pass_type             960 non-null    object 
 21  pass_switch           323 non-null    object 
 22  pass_through_ball     198 non-null    object 
 23  pass_technique        355 non-null    object 
 24  pass_backheel         14 non-null     object 
 25  pass_cross            754 non-null    object 
 26  counterpress          5 non-null      object 
 27  pass_cut_back         108 non-null    object 
 28  pass_inswinging       76 non-null     object 
 29  pass_straight         26 non-null     object 
 30  pass_outswinging      55 non-null     object 
dtypes: bool(1), float64(3), int64(1), object(26)
memory usage: 1.4+ MB

Boolean Features

Upon initial review it is apparent that a number of the features extracted are boolean in nature, but are currently object datatypes.

Define boolean features

boolean_features = ['shot_one_on_one',
'shot_open_goal',
'shot_first_time',
'shot_redirect',
'shot_deflected',
'shot_follows_dribble',
'under_pressure_x',
'counterpress',
'pass_switch',
'pass_through_ball',
'pass_backheel',
'pass_cross',
'pass_cut_back',
'pass_inswinging',
'pass_straight',
'pass_outswinging']

Convert boolean features to boolean datatype

extracted_data[boolean_features] = extracted_data[boolean_features].astype(bool)

Datetime

The second observation is that timestamp_x is currently object datatype, but should be datetime.

Convert timestamp_x datatype to datetime

extracted_data['timestamp_x'] = extracted_data['timestamp_x'].astype(str)extracted_data['timestamp_x'] = pd.to_datetime(extracted_data['timestamp_x'])

Results

extracted_data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 6080 entries, 0 to 6079
Data columns (total 31 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   period_x              6080 non-null   int64         
 1   timestamp_x           6080 non-null   datetime64[ns]
 2   play_pattern_x        6080 non-null   object        
 3   location_x            6080 non-null   object        
 4   under_pressure_x      6080 non-null   bool          
 5   shot_statsbomb_xg     6080 non-null   float64       
 6   shot_end_location     6080 non-null   object        
 7   shot_technique        6080 non-null   object        
 8   goal                  6080 non-null   bool          
 9   shot_type             6080 non-null   object        
 10  shot_body_part        6080 non-null   object        
 11  shot_one_on_one       6080 non-null   bool          
 12  shot_open_goal        6080 non-null   bool          
 13  shot_first_time       6080 non-null   bool          
 14  shot_redirect         6080 non-null   bool          
 15  shot_deflected        6080 non-null   bool          
 16  shot_follows_dribble  6080 non-null   bool          
 17  pass_length           4138 non-null   float64       
 18  pass_angle            4138 non-null   float64       
 19  pass_height           4138 non-null   object        
 20  pass_type             960 non-null    object        
 21  pass_switch           6080 non-null   bool          
 22  pass_through_ball     6080 non-null   bool          
 23  pass_technique        355 non-null    object        
 24  pass_backheel         6080 non-null   bool          
 25  pass_cross            6080 non-null   bool          
 26  counterpress          6080 non-null   bool          
 27  pass_cut_back         6080 non-null   bool          
 28  pass_inswinging       6080 non-null   bool          
 29  pass_straight         6080 non-null   bool          
 30  pass_outswinging      6080 non-null   bool          
dtypes: bool(17), datetime64[ns](1), float64(3), int64(1), object(9)
memory usage: 813.4+ KB

Conclusion

As part-three for the data cleaning portion of the data science workflow creating a soccer expected goals classification model, datatypes were corrected.

First, features which are boolean in nature were changed from object to boolean datatype.

Second, timestamp_x was changed from object to datetime datatype.

The corrected data will allow for better interpretation during the eventual modeling process.

Continued

This series explaining the processes within the data science workflow creating a soccer expected goals classification model, will be continued in part-four of the data cleaning process, explaining the process of identifying and replacing missing values:

Soccer Expected Goals — Data Cleaning — Part Four: Processing Missing Values

Explaining the process of identifying, assessing, and processing missing values.

medium.com

Soccer Expected Goals — Data Cleaning — Part Three: Correcting Datatypes

Background

Previous Steps

Soccer Expected Goals — Data Cleaning — Part One: Defining the Target Feature

Explaining the process of defining the target feature for the eventual model, goal.

Soccer Expected Goals — Data Cleaning — Part Two: Irrelevant Data

Explaining the process of dropping of irrelevant data.

Expected Goals (xG)

What is xG?

How the expected goals (xG) metric calculated and used within soccer.

Classification Model

The Data

Data Cleaning

Data Cleaning: Correcting Datatypes

Boolean Features

Datetime

Results

Conclusion

Continued

Soccer Expected Goals — Data Cleaning — Part Four: Processing Missing Values

Explaining the process of identifying, assessing, and processing missing values.

Written by Wes Swager