How to Build Your Own Expected Goals (xG) Model

Exploring StatsBomb World Cup 2022 Open Data

15 min readApr 12, 2023

World Cup 2022 Final shot map and xG provided by our model

A Little Background

It’s been a few months since the World Cup 2022 ended, and Statsbomb kindly made their data on this competition freely accessible. I was really interested in exploring this data since I’m a football fan and a data enthusiast, but I had not been able to find the right time back then. Recently, I had time to start exploring this data and learn the basics of football analytics. At first, I learned how to access Statsbomb data directly using their API. I was amazed by the massive amount of data stored in football. The event data from one match could be a collection of thousands of rows and dozens of columns. And this Statsbomb data is a really great source to start learning football analytics. However, after I looked around for other references, I found mplsoccer, a very useful library for football analytics, which can also be used to collect Statsbomb data. I learned how to implement mplsoccer to visualize passing network and shot map. While studying these, I got the idea to build my own expected goals (xG) model using World Cup 2022 shot data, and here you are reading this article about how I built my own xG model.

What is Expected Goals (xG)?

If you are a football fan, you must be familiar with this term because this metric has been used all around the world in recent years. It is a statistical probability of a shot resulting in a goal. The goal of using this metric is to measure the quality of a chance, rather than just using basic metrics like shots on target.

As an example, try to look at Lionel Messi’s shot from outside the penalty box against Mexico and compare it to Lautaro Martinez’s chance against Australia. Which of these two chances do you think has better quality?

The Lautaro Martinez’s one? Yes, you’re right, because according to FotMob, the xG of this chance is 0.35, while the xG of Messi’s shot is only 0.09. How do we interpret this value? Simple: if you take 100 shots in the same situation as Lautaro Martinez, you will likely score 35 goals.

Now the problem with xG is that every statistical site has its own xG model. Which is normal because this metric is gaining so much popularity. They have different models because their models are built on different datasets and different features. The Statsbomb xG model gives that exact Lautaro Martinez’s chance an xG of 0.25, while Messi’s is 0.03. Well, since everyone has their own opinion of the quality of chances, why don’t we build our own?

Google Colab Notebook

Google Colaboratory

Edit description

colab.research.google.com

You can access my notebook for this project by clicking the link above.

Collecting Statsbomb World Cup 2022 Open Data

Note that we can collect Statsbomb data with mplsoccer library, you just need to call the Sbopen() function from mplsoccer as a parser to get Statsbomb data. World Cup 2022’s competition_id and season_id are 43 and 106, respectively. You can also collect other competition’s data provided by Statsbomb, just make sure you read their API documentation.

parser = Sbopen()
df_match = parser.match(competition_id=43, season_id=106)
df_match.info()

Now that this df_match contains all matches played in WC 2022, every match has its own unique ID. If you look into this dataframe, it will appear like this:

We can access the event data of these matches by their ID. Since we want to use all shots from every match, we will iterate through all these match IDs.

# iterate through all matches to get the events data
df_matches = {}
for i, id in enumerate(df_match['match_id']):
  df_matches[id] = {}
  df_matches[id]['event'], df_matches[id]['related'], df_matches[id]['freeze'], df_matches[id]['tactic'] = parser.event(id)

We store the matches’ event data into a dictionary. Let’s look at the sample of event data filtered to shots only.

# example events data, filtered to shots only
df_matches[3857288]['event'][df_matches[3857288]['event']['type_name'] == 'Shot'].head()

Shots from event data for match_id = 3857288

This dataframe has many columns; scrolling left and right can be tiring sometimes.

Simple Expected Goals Model Features

Now it’s time to decide which column we will use to build our model. For our first try, let’s build something simple with only location variables (x and y). If it’s just raw location data, we won’t get much shot information from it, so we need to process it into something useful to measure chance quality. From the location variable, we can calculate angle and distance

We can calculate the angle from one point (location of the shot) to two points (goal posts) by doing some Pythagorean stuff later. First, we’ll prepare the data required for our modelling.

# we'll take the location, outcome, and also the xG by Statsbomb to compare it to our model later
df_shot = pd.DataFrame(columns=['x', 'y', 'outcome_name', 'shot_statsbomb_xg'])

for id in df_match['match_id']:
  # we take the period <= 4 because statsbomb also record penalty shots in the penalty shoot-out stage, we won't be using those shots
  # for our first model, we'll only take open play shots because penalty shots tend to have way higher goal probability
  # we'll use the other shots scenario in our next model
  mask_shot = (df_matches[id]['event'].type_name == 'Shot') & (df_matches[id]['event'].period <= 4) & (df_matches[id]['event'].sub_type_name == 'Open Play')
  shots_temp = df_matches[id]['event'].loc[mask_shot, ['x', 'y', 'outcome_name', 'shot_statsbomb_xg']]
  df_shot = pd.concat([df_shot, shots_temp]).reset_index(drop=True)

Notice that for this simple xG model, I will only take open-play shots since penalty shots tend to have a much higher xG and we don’t use the scenario column for our first model. We got 1,382 open play shots, which we can visualize with mplsoccer. We also collect Statsbomb xG for comparison.

# visualizing shots

# filter goals / non-shot goals
df_goals = df_shot[df_shot.outcome_name == 'Goal'].copy()
df_non_goal_shots = df_shot[df_shot.outcome_name != 'Goal'].copy()

# setup the pitch
pitch = VerticalPitch(pad_bottom=0.5,  # pitch extends slightly below halfway line
                      half=True,  # half of a pitch
                      goal_type='box',
                      goal_alpha=0.8, pitch_color='#22312b', line_color='#c7d5cc')  # control the goal transparency

fig, ax = pitch.draw(figsize=(12, 10))

sc1 = pitch.scatter(df_non_goal_shots.x, df_non_goal_shots.y,
                    c='#ba4f45',
                    marker='o',
                    ax=ax)

sc2 = pitch.scatter(df_goals.x, df_goals.y,
                    c='#ad993c',
                    marker='o',
                    ax=ax)

Preparing Data

Now we’ll define the necessary functions to get angle and distance.

import math

def calculate_angle(x, y):
  # 44 and 36 is the location of each goal post
  g0 = [120, 44]
  p = [x, y]
  g1 = [120, 36]

  v0 = np.array(g0) - np.array(p)
  v1 = np.array(g1) - np.array(p)

  angle = np.math.atan2(np.linalg.det([v0,v1]),np.dot(v0,v1))
  return(abs(np.degrees(angle)))

def calculate_distance(x, y):
  x_dist = 120-x
  y_dist = 0
  if (y<36):
    y_dist = 36-y
  elif (y>44):
    y_dist = y-44
  return math.sqrt(x_dist**2 + y_dist**2)

With a few lines, we got our dataset ready for modelling.

df_shot['angle'] = df_shot.apply(lambda row:calculate_angle(row['x'], row['y']), axis=1)
df_shot['distance'] = df_shot.apply(lambda row:calculate_distance(row['x'], row['y']), axis=1)
# we'll create new column to define goal or not
df_shot['goal'] = df_shot.apply(lambda row:1 if row['outcome_name']=='Goal' else 0, axis=1)

df_shot.groupby('goal').mean()

We can see that shots resulting in goals have a wider angle and shorter distance on average, which confirmed our intuition. Next, we’ll visualize it as a scatter plot with a regression line using Altair chart.

import altair as alt
  
fig = alt.Chart(df_shot).mark_point().encode(
  x='angle',y='goal')
  
fig + fig.transform_regression('angle','goal').mark_line()

fig = alt.Chart(df_shot).mark_point().encode(
  x='distance',y='goal')
  
fig + fig.transform_regression('distance','goal').mark_line()

Regression line between distance and goal

Modelling

We will build models with two different algorithms: linear regression and logistic regression. We will use logistic regression in our final model because we can’t use linear regression. I will explain why we can’t use it later. Let’s build our models and evaluate them with R2 Score.

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression

model_names = ['Linear', 'Logistic']
models = {}
models['Linear'] = {}
models['Linear']['model'] = LinearRegression()
models['Logistic'] = {}
models['Logistic']['model'] = LogisticRegression()

X = df_shot[['angle', 'distance']]
y = df_shot['goal']

from sklearn import metrics
for mod in model_names:
  models[mod]['model'].fit(X, y)
  if mod == 'Logistic':
    models[mod]['y_pred'] = models[mod]['model'].predict_proba(X)[:, 1]
  else:
    models[mod]['y_pred'] = models[mod]['model'].predict(X)
  
  models[mod]['r2_score'] = metrics.r2_score(y, models[mod]['y_pred'])
  print("R2 of model {}: {}".format(mod, models[mod]['r2_score']))

R2 Score of our 2 models

Linear regression has a slightly higher score. The R2 Score of the Statsbomb model is around 0.20; it’s obvious since their model has more features than ours. Let’s store the xG value into our dataframe.

df_shot['xG_Linear'] = models['Linear']['y_pred']
df_shot['xG_Logistic'] = models['Logistic']['y_pred']

Then visualize these values into scatter plot.

Scatter plot of Logistic xG and distance

Why We Can’t Use the Linear Model

The linear regression model works on a line, which means at some point it will reach negative values.
And you can see there are some negative xG predictions, which don’t make any sense since there is no such thing as negative probability.
The probability score should be between 0 and 1, so we can’t use this linear regression model.

Evaluation

The common use case for the xG model is to sum the xG of all shots from a match and then compare it to the actual score. In this case, we’ll also compare it to the xG of Statsbomb.

We’ll try to evaluate it on a Semi-Final match between Argentina and Croatia, which has the match_id of 3869519. We’ll define a function to calculate xG first and then take the match’s shot data.

# define function to calculate xG
def calculate_xg(x, y):
  angle = calculate_angle(x, y)
  distance = calculate_distance(x, y)
  X = [[angle, distance]]
  xg = models[mod]['model'].predict_proba(X)[:, 1][0]
  return xg

df_evaluate = df_matches[3869519]['event'][df_matches[3869519]['event']['type_name'] == 'Shot'].copy()

# take only open play
evaluate_mask = (df_evaluate.type_name == 'Shot') & (df_evaluate.period <= 4) & (df_evaluate.sub_type_name == 'Open Play')
df_evaluate = df_evaluate[evaluate_mask]

# calculate xg per shot
df_evaluate['our_xg'] = df_evaluate.apply(lambda row:calculate_xg(row['x'], row['y']), axis=1)

for team in df_evaluate.team_name.unique():
  df_team = df_evaluate[df_evaluate.team_name == team]
  actual_goal = len(df_team[df_team.outcome_name == 'Goal'])
  sum_xg = df_team.our_xg.sum()
  sum_xg_sb = df_team.shot_statsbomb_xg.sum()
  print(team)
  print("Actual open play goal: " + str(actual_goal))
  print("Expected open play goal: " + str(round(sum_xg, 2)))
  print("Expected open play goal by Statsbomb: " + str(round(sum_xg_sb, 2)))

According to our model, there’s only a 0.12 (0.84 & 0.96) difference on xG values, not a very big difference, while the Statsbomb model finds a way bigger difference (0.4 & 1.08). Argentina outperformed their xG on the two different xGs; they really took their chances effectively. Now that we know how to calculate xG for one match, we’ll calculate xG for each match of the World Cup and store it in one dataframe.

# let's try to evaluate it on all matches and store it into dataframe
df_summary = df_match[['match_id', 'home_team_name', 'away_team_name']].copy()

home_open_play_goal = []
home_open_play_xg = []
home_open_play_xg_sb = []

away_open_play_goal = []
away_open_play_xg = []
away_open_play_xg_sb = []

for i, id in enumerate(df_match.match_id):
  df_evaluate = df_matches[id]['event'][df_matches[id]['event']['type_name'] == 'Shot']

  # take only open play
  evaluate_mask = (df_evaluate.type_name == 'Shot') & (df_evaluate.period <= 4) & (df_evaluate.sub_type_name == 'Open Play')
  df_evaluate = df_evaluate[evaluate_mask]

  # calculate xg per shot
  df_evaluate['our_xg'] = df_evaluate.apply(lambda row:calculate_xg(row['x'], row['y']), axis=1)

  # home team
  df_home = df_evaluate[df_evaluate.team_name == df_match['home_team_name'][i]]
  home_open_play_goal.append(len(df_home[df_home.outcome_name == 'Goal']))
  home_open_play_xg.append(df_home.our_xg.sum())
  home_open_play_xg_sb.append(df_home.shot_statsbomb_xg.sum())

  # away team
  df_away = df_evaluate[df_evaluate.team_name == df_match['away_team_name'][i]]
  away_open_play_goal.append(len(df_away[df_away.outcome_name == 'Goal']))
  away_open_play_xg.append(df_away.our_xg.sum())
  away_open_play_xg_sb.append(df_away.shot_statsbomb_xg.sum())

df_summary['home_open_play_goal'] = home_open_play_goal
df_summary['home_open_play_xg'] = home_open_play_xg
df_summary['home_open_play_xg_sb'] = home_open_play_xg_sb

df_summary['away_open_play_goal'] = away_open_play_goal
df_summary['away_open_play_xg'] = away_open_play_xg
df_summary['away_open_play_xg_sb'] = away_open_play_xg_sb

Matches’ open play actual goals, xG, and xG by Statsbomb

From this dataframe, you can see the xG to actual goal differences for each match. And if we take the mean numbers of our model and the Statsbomb model, our model tends to give a higher xG.

Now let’s take a look at our xG on some iconic shots from the World Cup. I will choose Randal Kolo Muani last minute shot in the final and Bruno Fernandes-Cristiano Ronaldo ‘hair of god’ goal. We’ll take the x and y of those two shots by finding the id of the match (you can see the code in my notebook) and get the xG by running the calculate_xg() function.

Bruno Fernandes-Cristiano Ronaldo ‘hair of god’

The location of Kolo Muani’s shot is (103.7, 45.1) while Bruno’s is (99.4, 14.6).

Our model gave relatively low xG on Kolo Muani’s chance with just 8.05%, considering Statsbomb model gave this chance an xG of 27.75%. On Bruno’s chance, our model gave a higher xG with 2.07% compared to the 0.78% xG from the Statsbomb model.

A More Advanced xG Model

Now that we have finished our very first xG model with simple features, we will next build a more advanced xG model with additional features available on this Statsbomb data. Features I can think of with this data are:

Shot types of play (open play, penalty, free kick, corner)
Body part (header, preferable side)
Is under pressure?
Shot technique

We’ll collect this data with a similar approach to our previous shot data collection.

new_features = ['x', 'y', 'outcome_name', 'sub_type_name', 'body_part_name', 'under_pressure', 'technique_name', 'shot_statsbomb_xg']
df_shot = pd.DataFrame(columns=new_features)

for id in df_match['match_id']:
  mask_shot = (df_matches[id]['event'].type_name == 'Shot') & (df_matches[id]['event'].period <= 4)
  shots_temp = df_matches[id]['event'].loc[mask_shot, new_features]
  df_shot = pd.concat([df_shot, shots_temp]).reset_index(drop=True)

Preparing Data

We’ll apply the same functions to get angle and distance. Then, we’ll define a function to check if it’s on the preferable side. What I mean by preferable side is that if a right-footed player (usually a left winger like Jack Grealish) takes a shot from the left attacking side, he would have a higher chance of scoring with his right foot because he could perform a finesse shot instead of shooting with his left foot. We’ll also create a new column to check if it’s a header or not. Under pressure, we already have a column for that; we just need to fill the null values with 0. For types of play and techniques, we’ll use one-hot encoding. With these lines, we’ll get our data ready for modelling.

df_shot['angle'] = df_shot.apply(lambda row:calculate_angle(row['x'], row['y']), axis=1)
df_shot['distance'] = df_shot.apply(lambda row:calculate_distance(row['x'], row['y']), axis=1)

def is_preferable_side(y, body_part_name):
  # what I mean by preferable side is if a right-footed player gets a chance from left side of the pitch (usually a left winger)
  # he could perform a finesse right-footed shot which I think has bigger probability of scoring instead of left-footed shot
  preferable_side = 0
  side = 'center'
  if (y<40):
    side = 'left'
  elif (y>40):
    side = 'right'
  
  if ((side=='left') & (body_part_name=='Right Foot')) | ((side=='right') & (body_part_name=='Left Foot')):
    preferable_side = 1
  return preferable_side

df_shot['preferable_side'] = df_shot.apply(lambda row:is_preferable_side(row['y'], row['body_part_name']), axis=1)

df_shot['header'] = df_shot.apply(lambda row:1 if row['body_part_name']=='Head' else 0, axis=1)

df_shot['under_pressure'] = df_shot['under_pressure'].fillna(0)
df_shot['under_pressure'] = df_shot['under_pressure'].astype(int)

# one hot encoding for techniques and sub types
df_shot = pd.get_dummies(df_shot, columns = ['technique_name'])
df_shot = pd.get_dummies(df_shot, columns = ['sub_type_name'])

df_shot['goal'] = df_shot.apply(lambda row:1 if row['outcome_name']=='Goal' else 0, axis=1)

Modelling

Since we take not only open shots, we now have more rows with a total of 1,453 shots. We’ll build the model with logistic regression and evaluate it with the R2 score again.

X_cols = ['under_pressure', 'angle', 'distance',
       'preferable_side', 'header', 'technique_name_Backheel',
       'technique_name_Diving Header', 'technique_name_Half Volley',
       'technique_name_Lob', 'technique_name_Normal',
       'technique_name_Overhead Kick', 'technique_name_Volley',
       'sub_type_name_Corner', 'sub_type_name_Free Kick',
       'sub_type_name_Open Play', 'sub_type_name_Penalty']

X = df_shot[X_cols]
y = df_shot['goal']

adv_model = LogisticRegression()
adv_model.fit(X, y)
y_pred = adv_model.predict_proba(X)[:, 1]
metrics.r2_score(y, y_pred)

We managed to improve the R2 score, even though it’s still below the Statsbomb model, which is not too bad for our second model. Let’s store the predicted values into our dataframe and find out the correlation between xG and the features.

df_shot['xG_adv'] = y_pred
corr_cols = ['under_pressure', 'angle', 'distance',
       'preferable_side', 'header', 'technique_name_Backheel',
       'technique_name_Diving Header', 'technique_name_Half Volley',
       'technique_name_Lob', 'technique_name_Normal',
       'technique_name_Overhead Kick', 'technique_name_Volley',
       'sub_type_name_Corner', 'sub_type_name_Free Kick',
       'sub_type_name_Open Play', 'sub_type_name_Penalty',
       'xG_adv']
df_shot[corr_cols].corr().iloc[:,-1].sort_values()

It turns out my hypothesis about the preferable side is wrong; it has a negative value, meaning it is more probable to score with a left-footed shot from the left side than a right-footed shot. The penalty scenario has a high correlation, which confirms that penalty shots have a much higher chance of scoring than shots from open play.

Evaluation

Now we’ll calculate the xG on every match, just like before, by defining a function to calculate the xG and then storing the xG into one dataframe.

# define function to calculate xG
def calculate_xg_adv(row):
  under_pressure = 0 if np.isnan(row['under_pressure']) else 1
  angle = calculate_angle(row['x'], row['y'])
  distance = calculate_distance(row['x'], row['y'])
  preferable_side = is_preferable_side(row['y'], row['body_part_name'])
  header = 1 if row['body_part_name']=='Head' else 0
  technique_name = {}
  sub_type_name = {}
  technique_name['Backheel'] = technique_name['Diving Header'] = 0
  technique_name['Half Volley'] = technique_name['Lob'] = 0
  technique_name['Normal'] = technique_name['Overhead Kick'] = 0
  technique_name['Volley'] = sub_type_name['Corner'] = 0
  sub_type_name['Free Kick'] = sub_type_name['Open Play'] = 0
  sub_type_name['Penalty'] = 0
  technique_name[row['technique_name']] = 1
  sub_type_name[row['sub_type_name']] = 1
  X = [[under_pressure, angle, distance, preferable_side, header,
        technique_name['Backheel'], technique_name['Diving Header'],
        technique_name['Half Volley'], technique_name['Lob'],
        technique_name['Normal'], technique_name['Overhead Kick'],
        technique_name['Volley'], sub_type_name['Corner'],
        sub_type_name['Free Kick'], sub_type_name['Open Play'],
        sub_type_name['Penalty']]]
  xg = adv_model.predict_proba(X)[:, 1][0]
  return xg

# let's try to evaluate it on all matches and store it into dataframe
df_summary = df_match[['match_id', 'home_team_name', 'away_team_name']].copy()

home_goal = []
home_xg = []
home_xg_sb = []

away_goal = []
away_xg = []
away_xg_sb = []

for i, id in enumerate(df_match.match_id):
  df_evaluate = df_matches[id]['event'][df_matches[id]['event']['type_name'] == 'Shot']

  # take only open play
  evaluate_mask = (df_evaluate.type_name == 'Shot') & (df_evaluate.period <= 4)
  df_evaluate = df_evaluate[evaluate_mask]

  # calculate xg per shot
  df_evaluate['our_xg'] = df_evaluate.apply(lambda row:calculate_xg_adv(row), axis=1)

  # home team
  df_home = df_evaluate[df_evaluate.team_name == df_match['home_team_name'][i]]
  home_goal.append(len(df_home[df_home.outcome_name == 'Goal']))
  home_xg.append(df_home.our_xg.sum())
  home_xg_sb.append(df_home.shot_statsbomb_xg.sum())

  # away team
  df_away = df_evaluate[df_evaluate.team_name == df_match['away_team_name'][i]]
  away_goal.append(len(df_away[df_away.outcome_name == 'Goal']))
  away_xg.append(df_away.our_xg.sum())
  away_xg_sb.append(df_away.shot_statsbomb_xg.sum())

df_summary['home_goal'] = home_goal
df_summary['home_xg'] = home_xg
df_summary['home_xg_sb'] = home_xg_sb

df_summary['away_goal'] = away_goal
df_summary['away_xg'] = away_xg
df_summary['away_xg_sb'] = away_xg_sb

Matches’ actual goals, our advanced xG, xG by Statsbomb

If we calculate the mean of our xG values, it still yields a higher value than Statsbomb xG values.

Teams Who Overperformed and Underperformed

One thing we can do with xG stats is identify which teams have good finishing (actual goals > xG) and which teams have bad finishing (actual goals xG). We achieve that by doing simple aggregation by team and then subtracting actual goals with xG.

new_features = ['team_name', 'x', 'y', 'outcome_n# overperformed
team_summary.sort_values('difference', ascending=False).head(10)ame', 'sub_type_name', 'body_part_name', 'under_pressure', 'technique_name', 'shot_statsbomb_xg']
df_shot_team = pd.DataFrame(columns=new_features)

for id in df_match['match_id']:
  mask_shot = (df_matches[id]['event'].type_name == 'Shot') & (df_matches[id]['event'].period <= 4)
  shots_temp = df_matches[id]['event'].loc[mask_shot, new_features]
  df_shot_team = pd.concat([df_shot_team, shots_temp]).reset_index(drop=True)

df_shot_team['our_xg'] = df_shot_team.apply(lambda row:calculate_xg_adv(row), axis=1)
df_shot_team['goal'] = df_shot_team.apply(lambda row:1 if row['outcome_name'] == 'Goal' else 0, axis=1)
team_summary = df_shot_team.groupby('team_name')[['our_xg', 'shot_statsbomb_xg', 'goal']].sum().reset_index()

team_summary['difference'] = team_summary['goal']-team_summary['our_xg']

# overperformed
team_summary.sort_values('difference', ascending=False).head(10)

# underperformed
team_summary.sort_values('difference').head(10)

According to our xG model, Netherlands are the highest overachiever with a 4.9 goal difference between actual goals and xG; they scored 10 goals when they’re only expected to score 5.08 goals. On the contrary, Belgium and Canada are the two unluckiest teams in this World Cup; they have xG greater than 4.5 but only managed to score 1, and Brazil comes third. Belgium on the top list reminds me of those Lukaku missed chances against Croatia, and Livakovic’s performance against Brazil in the Quarter Finals surely plays a big part in Brazil’s inclusion in this list.

Visualizing Shot Map with xG

For the last part, I will show you how the image at the top of this article is made. We’ll take shot data from the final match and plot it into an mplsoccer pitch, and then add the text.

mask_argentina = (df_matches[3869685]['event'].type_name == 'Shot') & (df_matches[3869685]['event'].period <= 4) & (df_matches[3869685]['event'].team_name == 'Argentina')
mask_france = (df_matches[3869685]['event'].type_name == 'Shot') & (df_matches[3869685]['event'].period <= 4) & (df_matches[3869685]['event'].team_name == 'France')
df_argentina = df_matches[3869685]['event'][mask_argentina]
df_france = df_matches[3869685]['event'][mask_france]

df_france['our_xg'] = df_france.apply(lambda row:calculate_xg_adv(row), axis=1)
df_argentina['our_xg'] = df_argentina.apply(lambda row:calculate_xg_adv(row), axis=1)

# filter goals / non-shot goals
pitch = Pitch(pitch_type='statsbomb')
fig, ax = pitch.draw(figsize=(12, 8))

mask_france_goal = (df_france['outcome_name'] == 'Goal')
mask_france_no_goal = (df_france['outcome_name'] != 'Goal')
mask_argentina_goal = (df_argentina['outcome_name'] == 'Goal')
mask_argentina_no_goal = (df_argentina['outcome_name'] != 'Goal')

# plot france no goal
sc1 = pitch.scatter(120-df_france[mask_france_no_goal].x, 80-df_france[mask_france_no_goal].y,
                    s=(df_france[mask_france_no_goal].our_xg * 1000),
                    c='#002153',
                    marker='o',
                    ax=ax, alpha=0.8)

# plot france goals
sc2 = pitch.scatter(120-df_france[mask_france_goal].x, 80-df_france[mask_france_goal].y,
                    s=(df_france[mask_france_goal].our_xg * 1000),
                    edgecolors='#002153',
                    linewidth=0.6,
                    c='white',
                    marker='football',
                    ax=ax, alpha=0.8)

# plot argentina no goals
sc3 = pitch.scatter(df_argentina[mask_argentina_no_goal].x,df_argentina[mask_argentina_no_goal].y,
                    s=(df_argentina[mask_argentina_no_goal].our_xg * 1000),
                    c='#6bade4',
                    marker='o',
                    ax=ax, alpha=0.8)

# plot argentina goals
sc4 = pitch.scatter(df_argentina[mask_argentina_goal].x,df_argentina[mask_argentina_goal].y,
                    s=(df_argentina[mask_argentina_goal].our_xg * 1000),
                    edgecolors='#6bade4',
                    linewidth=0.6,
                    c='white',
                    marker='football',
                    ax=ax, alpha=0.8)

ax.text(x=5, y=8, s='France\nShots',
              size=30,
              color=pitch.line_color,
              va='center', ha='left', weight='bold')
ax.text(x=5, y=15, s='Total Shots: {}, Total xG: {}, Actual Goals: {}'.format(len(df_france),
                                                                                         round(df_france.our_xg.sum(), 2),
                                                                                         df_france[mask_france_goal].shape[0]),
              size=10,
              color=pitch.line_color,
              va='center', ha='left')
        
ax.text(x=115, y=8, s='Argentina\nShots',
              size=30,
              color=pitch.line_color,
              va='center', ha='right', weight='bold')
ax.text(x=115, y=15, s='Total Shots: {}, Total xG: {}, Actual Goals: {}'.format(len(df_argentina),
                                                                                         round(df_argentina.our_xg.sum(), 2),
                                                                                         df_argentina[mask_argentina_goal].shape[0]),
              size=10,
              color=pitch.line_color,
              va='center', ha='right')

Closing

If you open up my notebook, you’ll notice that in the bottom I have built an expected goal on target (xGOT) or post-shot expected goal (PS-xG) model. I was planning to write about that too after this article, and I hope you’ll be looking forwards to it. Well, that’s it for my very first article related to football analytics. I’m still a newbie in this particular field, so I really hope you could give me some feedback for me to improve my football analysis skills and my writing. Thanks for reading; cheers!

How to Build Your Own Expected Goals (xG) Model

Exploring StatsBomb World Cup 2022 Open Data

A Little Background

What is Expected Goals (xG)?

Google Colab Notebook

Google Colaboratory

Edit description

Collecting Statsbomb World Cup 2022 Open Data

Simple Expected Goals Model Features

Preparing Data

Modelling

Why We Can’t Use the Linear Model

Evaluation

A More Advanced xG Model

Preparing Data

Modelling

Evaluation

Teams Who Overperformed and Underperformed

Visualizing Shot Map with xG

Closing

Written by Alfian Hakim