Women’s Soccer Expected Goals — Data Extraction

Wes Swager
8 min readOct 29, 2021

--

The data extraction process of the data science workflow for creating a women’s club soccer expected goals (xG) classification model, predicting the likelihood that a shot will score.

Please Note: The following was previously published broken into individual posts per step of the data extraction process in:

Introduction

Expected Goals (xG)

xG is used to indicate the quality of a shot.

xG, as a metric, measures the likelihood that a shot will result in a goal based on the characteristics of the shot and the play preceding the shot.

xG is measured on a scale between zero and one, with one representing a goal. For example, a shot with a 0.5 xG indicates a shot having a 50% chance of being a goal.

Classification Model

The metric of expected goals is calculated through the use of a classification model.

A classification model refers to a predictive modeling problem where a class label is predicted for a given example of input data.

For this project a supervised approach was used with the training data for the model including which shots were goals.

Data Extraction

Data extraction is the act or process of retrieving data out of…data sources for further data processing

Wikipedia

The Data

The data for this project was extracted from StatsBomb’s Open Data.

StatsBomb are a United Kingdom based football (soccer) data analytics company. StatsBomb provide free access to a segment of their proprietary dataset via GitHub: StatsBomb Open Data

StatsBomb Open Data is organized in JSON files:

Matches

  • Folders are organized by competition (league or tournament)
  • Files within the folders are organized by season (year) ID
  • Files contain nested dictionaries with descriptive data for each individual match

Events

  • Files organized by match ID
  • Files contain nested dictionaries with descriptive data for each event within each individual match

For the purposes of this project the relevant data targeted was, primarily, characteristics of shots and, secondarily, characteristics of the plays creating those shots, from women’s club soccer matches.

Note: Assessment of plays creating shots is subjective and based on domain knowledge specific to the sport of soccer

statsbombpy

The following processes of extracting data from the StatsBomb Open Data relies heavily on the use of statsbombpy.

statsbombpy is a Python package created by StatsBomb which streams StatsBomb data directly within python. The package allows access to StatsBomb Open Data for free or allows access to their API with use of log-in credentials.

Data Extraction

Extract Target Matches Data

Import statsbombpy for extracting StatsBomb data

!pip install statsbombpy
from statsbombpy import sb

View competitions available through StatsBomb Open Data

competitions_df = sb.competitions()print('Available Competitions:', competitions_df['competition_name'].unique())Available Competitions: ['Champions League' "FA Women's Super League" 'FIFA World Cup' 'La Liga'
'NWSL' 'Premier League' "Women's World Cup"]

Isolate and display target women’s competitions, competition ids, and season ids

target_comp_df = competitions_df.loc[competitions_df['competition_gender'] == 'female']target_comp_ids = target_comp_df['competition_id'].unique()target_season_ids = target_comp_df['season_id'].unique()print("Women's Competitions:", target_comp_df['competition_name'].unique(), '\n', "Women's competition_ids:", target_comp_ids, '\n', "Women's Competition season_ids:", target_season_ids)Women's Competitions: ["FA Women's Super League" 'NWSL'] 
Women's competition_ids: [37 49]
Women's Competition season_ids: [42 4 3]

Refine target competitions to women’s club competitions

target_comp_df = competitions_df.loc[competitions_df['competition_id'].isin([37, 49])]target_comp_ids = target_comp_df['competition_id'].unique()target_season_ids = target_comp_df['season_id'].unique()print("Women's Club Competitions:", target_comp_df['competition_name'].unique(), '\n', "Women's Club competition_ids:", target_comp_ids, '\n', "Women's Club Competition season_ids:", target_season_ids)Women's Club Competitions: ["FA Women's Super League" 'NWSL'] 
Women's Club competition_ids: [37 49]
Women's Club Competition season_ids: [42 4 3]

Create dataframes for the matches in each season of the target competitions

matches_df_37_42 = sb.matches(competition_id = 37, season_id = 42)matches_df_37_4 = sb.matches(competition_id = 37, season_id = 4)matches_df_49_3 = sb.matches(competition_id = 49, season_id = 3)

Concatenate target season dataframes into combined dataframe

matches_df = pd.concat([matches_df_37_42, matches_df_37_4, matches_df_49_3], ignore_index = True)matches_df.head()
print('Number of Seasons:', len(target_season_ids))Number of Women's Club Seasons: 3print('Total Women's Club Matches:', len(matches_df))Total Matches: 231

Extract Shot Events Data

Create dataframes for shot events in each season of the target competitions

shots_df_37_42 = sb.competition_events(country = 'England', division = "FA Women's Super League", season = '2018/2019', gender = 'female', split = True)['shots']shots_df_37_4 = sb.competition_events(country = 'England', division = "FA Women's Super League", season = '2019/2020', gender = 'female', split = True)['shots']shots_df_49_3 = sb.competition_events(country = 'United States of America', division = 'NWSL', season = '2018', gender = 'female', split = True)['shots']

Concatenate shot events dataframes into combined dataframe

shots_df = pd.concat([shots_df_37_42, shots_df_37_4, shots_df_49_3], ignore_index = True)shots_df.head()
print("Total Women's Club Shot Events:", len(shots_df))Total Women's Club Competition Shot Events: 6114print('Total Shot Features:', shots_df.shape[1])Total Shot Features: 37

Extract Key Pass Features

As mentioned previously, characteristics of the play creating a shot can be valuable toward the potential quality of the shot. For this reason, features of other event types related to shot events can be valuable and worth assessment.

shot_key_pass_id is a feature for shot events which identifies specific pass events (indicating the event id feature value), as passes which lead directly to the shot (potential assist, if the shot were to score).

Create dataframes for pass events in each season of the target competitions

passes_df_37_42 = sb.competition_events(country = 'England', division = "FA Women's Super League", season = '2018/2019', gender = 'female', split = True)['passes']passes_df_37_4 = sb.competition_events(country = 'England', division = "FA Women's Super League", season = '2019/2020', gender = 'female', split = True)['passes']passes_df_49_3 = sb.competition_events(country = 'United States of America', division = 'NWSL', season = '2018', gender = 'female', split = True)['passes']

Concatenate pass events dataframes into combined dataframe

passes_df = pd.concat([passes_df_37_42, passes_df_37_4, passes_df_49_3], ignore_index = True)passes_df.head()
print("Total Women's Club Competition Pass Events:", len(passes_df))Total Women's Club Competition Pass Events: 209122print('Total Pass Features:', passes_df.shape[1])Total Pass Features: 45

Create a key_pass_events list from shot_key_pass_id values for shot events

key_pass_events = list(shots_df[‘shot_key_pass_id’])

Search pass events for key passes

passes_to_shots_df = passes_df[passes_df['id'].isin(key_pass_events)]print("Pass Events Identified as 'shot_key_pass_id' for Shot Events:", len(passes_to_shots_df))Pass Events Identified as 'shot_key_pass_id' for Shot Events: 4164

Concatenate pass data from passes_df for passes identified as shot event key passes with shots_df

passes_df2 = passes_df.rename(columns = {'id': 'shot_key_pass_id'})extracted_data = pd.merge(shots_df, passes_df2, on = ['shot_key_pass_id'], how = 'left')extracted_data.head()
print('Updated Shot w/ Key Pass Features:', extracted_data.shape[1])Updated Shot w/ Key Pass Features: 81

Extract Related Events Features

related_events is a feature for shot events which identifies specific events other than shot or pass events (indicating the event id feature value), as events significantly connected with leading to the shot.

In exploring the types of events available in the StatsBomb data, dribble and carry events appear to be the event types most likely to provide valuable data toward describing the quality of the shot.

Note: Assessment of plays creating shots is subjective and based on domain knowledge specific to the sport of soccer

Create a list from related_events values for shot events

related_events = list(shots_df['related_events'])

Dribble Events

Create dataframes for dribble events in each season of the target competitions

dribbless_df_37_42 = sb.competition_events(country = 'England', division = "FA Women's Super League", season = '2018/2019', gender = 'female', split = True)['dribbles']dribbles_df_37_4 = sb.competition_events(country = 'England', division = "FA Women's Super League", season = '2019/2020', gender = 'female', split = True)['dribbles']dribbles_df_49_3 = sb.competition_events(country = 'United States of America', division = 'NWSL', season = '2018', gender = 'female', split = True)['dribbles']

Concatenate dribble event dataframes into a combined dataframe

dribbles_df = pd.concat([dribbles_df_37_42, dribbles_df_37_4, dribbles_df_49_3], ignore_index = True)dribbles_df.head()
print("Total Women's Club Competition Dribble Events:", len(dribbles_df))Total Women's Club Competition Dribble Events: 8187print('Total Dribble Features:', dribbles_df.shape[1])Total Dribble Features: 23

Search dribble events for related_events

dribbles_to_shots_df = dribbles_df[dribbles_df[‘id’].isin(related_events)]print(“Dribble Events Identified as ‘related_events’ for Shot Events:”, len(dribbles_to_shots_df))Dribble Events Identified as 'related_events' for Shot Events: 0

No related_events values for shot events match the id feature values of dribble events, therefore no additional data can be extracted or added.

Carry Events

Create dataframes for carry events in each season of the target competitions

carrys_df_37_42 = sb.competition_events(country = 'England', division = "FA Women's Super League", season = '2018/2019', gender = 'female', split = True)['carrys']carrys_df_37_4 = sb.competition_events(country = 'England', division = "FA Women's Super League", season = '2019/2020', gender = 'female', split = True)['carrys']carrys_df_49_3 = sb.competition_events(country = 'United States of America', division = 'NWSL', season = '2018', gender = 'female', split = True)['carrys']

Concatenate carry event dataframes into a combined dataframe

carrys_df = pd.concat([carrys_df_37_42, carrys_df_37_4, carrys_df_49_3], ignore_index = True)carrys_df.head()
print("Total Women's Club Competition Carry Events:", len(carrys_df))Total Women's Club Competition Carry Events: 168439print('Total Carry Features:', carrys_df.shape[1])Total Carry Features: 19

Search carry events for related_events

carrys_to_shots_df = carrys_df[carrys_df[‘id’].isin(related_events)]print(“Carry Events Identified as ‘related_events’ for Shot Events:”, len(carrys_to_shots_df))Carry Events Identified as 'related_events' for Shot Events: 0

No related_events values for shot events match the id feature values of carry events, therefore no additional data can be extracted or added.

Review

  • Target matches, women’s club compeitions, were identified and their data extracted
  • Shot event data was extracted from each of the target competition seasons and concatenated into a combined dataframe
  • Key passes were identified and their data extracted and concatenated with the shot events
  • Related events were explored but it was concluded no additional valuable data was available

Results

print('Total Extracted Events:', len(extracted_data))Total Extracted Events: 6114print('Total Extracted Features:', extracted_data.shape[1])Total ExtractedFeatures: 81

The extraction process yielded 6,144 shot events with 81 features, describing the shots or the play preceding the shots.

More

If you liked this post, please give it an applause and follow me as I will be continuing with a series of posts for each process through the data science workflow of my Women’s Soccer Expected Goals Model:

Also, follow me on Twitter, where I post regularly about tactical observations for soccer:

I would love to read any feedback you might have in the comments.

--

--