Women’s Soccer Expected Goals — Data Extraction — Part-One: Extracting Shots Data
Part-one of a series for the data science workflow creating a women’s club soccer expected goals (xG) classification model, explaining the process of using the statsbombpy package to extract shot event data from StatsBomb.
Introduction
Expected Goals (xG)
xG is used to indicate the quality of a shot.
xG, as a metric, measures the likelihood that a shot will result in a goal based on the characteristics of the shot and the play preceding the shot.
xG is measured on a scale between zero and one, with one representing a goal. For example, a shot with a 0.5 xG indicates a shot having a 50% chance of being a goal.
Classification Model
The metric of expected goals is calculated through the use of a classification model.
A classification model refers to a predictive modeling problem where a class label is predicted for a given example of input data.
For this project a supervised approach was used with the training data for the model including which shots were goals.
Data Extraction
Data extraction is the act or process of retrieving data out of…data sources for further data processing
The Data
The data for this project was extracted from StatsBomb’s Open Data.
StatsBomb are a United Kingdom based football (soccer) data analytics company. StatsBomb provide free access to a segment of their proprietary dataset via GitHub: StatsBomb Open Data
StatsBomb Open Data is organized in JSON files:
Matches
- Folders are organized by competition (league or tournament)
- Files within the folders are organized by season (year) ID
- Files contain nested dictionaries with descriptive data for each individual match
Events
- Files organized by match ID
- Files contain nested dictionaries with descriptive data for each event within each individual match
For the purposes of this project the relevant data targeted was, primarily, characteristics of shots and, secondarily, characteristics of the plays creating those shots, from women’s club soccer matches.
Note: Assessment of plays creating shots is subjective and based on domain knowledge specific to the sport of soccer
statsbombpy
The following processes of extracting data from the StatsBomb Open Data relies heavily on the use of statsbombpy.
statsbombpy is a Python package created by StatsBomb which streams StatsBomb data directly within python. The package allows access to StatsBomb Open Data for free or allows access to their API with use of log-in credentials.
Data Extraction
Extract Target Matches Data
Import statsbombpy for extracting StatsBomb data
!pip install statsbombpy
from statsbombpy import sb
View competitions available through StatsBomb Open Data
competitions_df = sb.competitions()print('Available Competitions:', competitions_df['competition_name'].unique())Available Competitions: ['Champions League' "FA Women's Super League" 'FIFA World Cup' 'La Liga'
'NWSL' 'Premier League' "Women's World Cup"]
Isolate and display target women’s competitions, competition ids, and season ids
target_comp_df = competitions_df.loc[competitions_df['competition_gender'] == 'female']target_comp_ids = target_comp_df['competition_id'].unique()target_season_ids = target_comp_df['season_id'].unique()print("Women's Competitions:", target_comp_df['competition_name'].unique(), '\n', "Women's competition_ids:", target_comp_ids, '\n', "Women's Competition season_ids:", target_season_ids)Women's Competitions: ["FA Women's Super League" 'NWSL']
Women's competition_ids: [37 49]
Women's Competition season_ids: [42 4 3]
Refine target competitions to women’s club competitions
target_comp_df = competitions_df.loc[competitions_df['competition_id'].isin([37, 49])]target_comp_ids = target_comp_df['competition_id'].unique()target_season_ids = target_comp_df['season_id'].unique()print("Women's Club Competitions:", target_comp_df['competition_name'].unique(), '\n', "Women's Club competition_ids:", target_comp_ids, '\n', "Women's Club Competition season_ids:", target_season_ids)Women's Club Competitions: ["FA Women's Super League" 'NWSL']
Women's Club competition_ids: [37 49]
Women's Club Competition season_ids: [42 4 3]
Create dataframes for the matches in each season of the target competitions
matches_df_37_42 = sb.matches(competition_id = 37, season_id = 42)matches_df_37_4 = sb.matches(competition_id = 37, season_id = 4)matches_df_49_3 = sb.matches(competition_id = 49, season_id = 3)
Concatenate target season dataframes into combined dataframe
matches_df = pd.concat([matches_df_37_42, matches_df_37_4, matches_df_49_3], ignore_index = True)matches_df.head()
print('Number of Seasons:', len(target_season_ids))Number of Women's Club Seasons: 3print('Total Women's Club Matches:', len(matches_df))Total Matches: 231
Extract Shot Events Data
Create dataframes for shot events in each season of the target competitions
shots_df_37_42 = sb.competition_events(country = 'England', division = "FA Women's Super League", season = '2018/2019', gender = 'female', split = True)['shots']shots_df_37_4 = sb.competition_events(country = 'England', division = "FA Women's Super League", season = '2019/2020', gender = 'female', split = True)['shots']shots_df_49_3 = sb.competition_events(country = 'United States of America', division = 'NWSL', season = '2018', gender = 'female', split = True)['shots']
Concatenate shot events dataframes into combined dataframe
shots_df = pd.concat([shots_df_37_42, shots_df_37_4, shots_df_49_3], ignore_index = True)
Results
print("Total Women's Club Shot Events:", len(shots_df))Total Women's Club Competition Shot Events: 6114print('Total Shot Features:', shots_df.shape[1])Total Shot Features: 37
The extraction process for shot event data yielded 6,144 shot events with 37 features.
shots_df.head()
Continued
Part-two continues the series for the data science workflow creating a women’s club soccer expected goals (xG) classification model, explaining the process of using the statsbombpy package to extract shot event key pass features.
More
If you liked this post, please give it an applause and follow me as I will be continuing with a series of posts for each process through the data science workflow of my Women’s Soccer Expected Goals Model:
Also, follow me on Twitter, where I post regularly about tactical observations for soccer:
I would love to read any feedback you might have in the comments.