How to Build Your Own Expected Goals on Target (xGOT) Model
Exploring StatsBomb World Cup 2022 Open Data
Continuing my article from last month about xG, as promised, I will now discuss how to build the xGOT model. This article serves as a continuation of my exploration into learning about StatsBomb World Cup 2022 Open Data.
What is Expected Goals on Target (xGOT)?
Expected Goals on Target (xGOT) is an extension of xG that specifically focuses on shots that are on target. While xG represents the probability of a shot based on the situation when the shot is taken, xGOT is based on the situation after the shot is taken, specifically where the shot ends. This is why another term for xGOT is Post-Shot Expected Goals (PSxG). To better understand this concept, let’s consider two different goals from the World Cup 2022.
For example, let’s analyze Messi’s goal against Mexico, where the ball ended up in the bottom right corner, making it very difficult for Ochoa to save. According to FotMob, the xGOT value for this shot is 0.47. Now, let’s compare it to Saka’s second goal against Iran, where the ball ended up very close to the bottom middle of the goal. The xGOT value for this shot is only 0.12.
Since the xGOT value depends on the placement of the shot, this metric is also used to assess a goalkeeper’s shot-stopping ability. For a goalkeeper, it is easier to save Saka’s shot than Messi’s shot. This is why Messi’s xGOT value is higher than Saka’s. The xGOT metric assigns higher values to shots that are well-placed, such as in the corners of the goal, as opposed to shots that end up in the center where goalkeepers can easily catch them. Now, I hope you have a clear understanding of the xGOT model concept.
Google Colab Notebook
You can access my notebook for this project by clicking the link above.
Take a Look at StatsBomb Events Data
As you may already know, we will only consider shots that are on target. In order to determine whether a shot is on target or not, we will be using the columns ‘end_y’ and ‘end_z’. Additionally, we need to include the ‘outcome_name’ column to determine the outcome of the shot.
# we'll take the location where the shots ended
a = pd.DataFrame(columns=['end_y', 'end_z', 'outcome_name'])
for id in df_match['match_id']:
mask_shot = (df_matches[id]['event'].type_name == 'Shot') & (df_matches[id]['event'].period <= 4)
shots_temp = df_matches[id]['event'].loc[mask_shot, ['end_y', 'end_z', 'outcome_name']]
a = pd.concat([a, shots_temp]).reset_index(drop=True)
By executing the code above, we will store the shots in the variable ‘a’. Now, in order to determine whether a shot is on target or not, we need to establish the precise boundaries for the y and z coordinates, which correspond to the goal posts and the crossbar. To determine these boundaries, we will examine the values in the ‘outcome_name’ column, as they indicate whether shots hit the post or crossbar.
a.outcome_name.unique()
Shots that hit the post or crossbar will have an ‘outcome_name’ of “Post”. Therefore, we will examine these specific shots to analyze their characteristics.
a[a.outcome_name == 'Post']
Based on the information provided in my previous article, I determined that the location of the goal post is between the values 36 and 44. Therefore, for a shot to be considered on target, its ‘end_y’ value must fall within this range. As for the crossbar location, I found it slightly confusing. However, based on the data, the lowest ‘end_z’ value when a shot hits the crossbar is 2.8, and the highest value is 3.0. Therefore, I will consider 2.8 as the threshold for the crossbar height.
Filtering Only the Shots on Target
Now, to filter only the shots on target, we will apply the following conditions:
- The ‘end_y’ value must be between 36 and 44.
- The ‘end_z’ value must be greater than or equal to 2.8 to account for shots hitting the crossbar.
By applying these conditions, we will obtain the shots that meet the criteria for being on target.
mask_on_target = (a.end_y <= 44) & (a.end_y >= 36) & (a.end_z <= 2.8)
a = a[mask_on_target].reset_index(drop=True)
Visualization
Now, we will create a visualization that includes a goal image and scatter plots representing the shots to gain a clear understanding of where the shots ended up.
from matplotlib.patches import Rectangle
#Create figure
fig=plt.figure(facecolor='#22312b')
fig.set_size_inches(12, 4.2)
#Goal post lines
plt.plot([34,46],[0,0], color='#c7d5cc', linewidth=1.5)
plt.plot([36,44],[2.8,2.8], color='#c7d5cc', linewidth=3)
plt.plot([44,44],[0,2.8], color='#c7d5cc', linewidth=3)
plt.plot([36,36],[0,2.8], color='#c7d5cc', linewidth=3)
#Goal net
plt.gca().add_patch(Rectangle((36, 0), 8, 2.8, fill=False, edgecolor='#c7d5cc', hatch='+', alpha=0.2))
#Tidy Axes
plt.axis('off')
goal_mask = a.outcome_name == 'Goal'
no_goal_mask = a.outcome_name != 'Goal'
sc1 = plt.scatter(a[no_goal_mask].end_y, a[no_goal_mask].end_z,
marker='o', color='#ba4f45', label='No Goal')
sc2 = plt.scatter(a[goal_mask].end_y, a[goal_mask].end_z,
marker='o', color='#ad993c', label='Goal')
plt.ylim(ymin=-0.2, ymax=4)
plt.xlim(xmin=34, xmax=46)
plt.legend()
plt.show()
From this visualization, it becomes evident that most of the goals are scored when the shots are aimed at the corners of the goal, particularly the bottom left or bottom right corners.
Preparing Data
In this modeling, the target column will be ‘goal’ or ‘not goal’, just like the xG model. As for the features, we will modify the ‘end_y’ and ‘end_z’ values to represent the distance from the center of the goal. We can assume that the further the shot is from the center of the goal, the more likely it is to result in a goal, with the corners being the farthest points.
a['goal'] = a.apply(lambda row:1 if row['outcome_name']=='Goal' else 0, axis=1)
a['end_y_center'] = a.apply(lambda row:abs(40-row['end_y']), axis=1)
a['end_z_center'] = a.apply(lambda row:abs(1.4-row['end_z']), axis=1)
Afterward, we can visualize the regression line to further analyze the relationship between the distance from the goal center and the likelihood of scoring.
# visualize regression line
fig = alt.Chart(a).mark_point().encode(x='end_y_center',y='goal')
fig + fig.transform_regression('end_y_center','goal').mark_line()
fig = alt.Chart(a).mark_point().encode(x='end_z_center',y='goal')
fig + fig.transform_regression('end_z_center','goal').mark_line()
Based on the analysis, it appears that the horizontal distance from the goal has a greater influence on the likelihood of scoring compared to the vertical distance. This suggests that shots aimed towards the corners of the goal have a higher probability of resulting in goals, regardless of the vertical position.
Modelling
Now, let’s proceed with building the model using Logistic Regression.
X = a[['end_y_center', 'end_z_center']]
y = a['goal']
xgot_model = LogisticRegression()
xgot_model.fit(X, y)
y_pred = xgot_model.predict_proba(X)[:,-1]
metrics.r2_score(y, y_pred)
The R2 Score obtained from the model is approximately 0.02. With the model in place, we can now utilize it to calculate the xGOT for every match.
Evaluation on All Matches
To accomplish this, we will start by defining a function to calculate xGOT.
# define function to calculate xGOT
def calculate_xgot(row):
if (row.end_y <= 44) & (row.end_y >= 36) & (row.end_z <= 2.8): # if on target
end_y_center = abs(40-row['end_y'])
end_z_center = abs(1.4-row['end_z'])
X = [[end_y_center, end_z_center]]
xgot = xgot_model.predict_proba(X)[:, 1][0]
return xgot
else:
return 0
Next, we will apply this function for all matches, similar to how we implemented it for xG in the previous article. We will also include our previous xG model for comparison.
df_summary = df_match[['match_id', 'home_team_name', 'away_team_name']].copy()
home_goal = []
home_xg = []
home_xg_sb = []
home_xgot = []
away_goal = []
away_xg = []
away_xg_sb = []
away_xgot = []
for i, id in enumerate(df_match.match_id):
df_evaluate = df_matches[id]['event'][df_matches[id]['event']['type_name'] == 'Shot']
# take only open play
evaluate_mask = (df_evaluate.type_name == 'Shot') & (df_evaluate.period <= 4)
df_evaluate = df_evaluate[evaluate_mask]
# calculate xg per shot
df_evaluate['our_xg'] = df_evaluate.apply(lambda row:calculate_xg_adv(row), axis=1)
df_evaluate['our_xgot'] = df_evaluate.apply(lambda row:calculate_xgot(row), axis=1)
# home team
df_home = df_evaluate[df_evaluate.team_name == df_match['home_team_name'][i]]
home_goal.append(len(df_home[df_home.outcome_name == 'Goal']))
home_xg.append(df_home.our_xg.sum())
home_xg_sb.append(df_home.shot_statsbomb_xg.sum())
home_xgot.append(df_home.our_xgot.sum())
# away team
df_away = df_evaluate[df_evaluate.team_name == df_match['away_team_name'][i]]
away_goal.append(len(df_away[df_away.outcome_name == 'Goal']))
away_xg.append(df_away.our_xg.sum())
away_xg_sb.append(df_away.shot_statsbomb_xg.sum())
away_xgot.append(df_away.our_xgot.sum())
df_summary['home_goal'] = home_goal
df_summary['home_xg'] = home_xg
df_summary['home_xg_sb'] = home_xg_sb
df_summary['home_xgot'] = home_xgot
df_summary['away_goal'] = away_goal
df_summary['away_xg'] = away_xg
df_summary['away_xg_sb'] = away_xg_sb
df_summary['away_xgot'] = away_xgot
df_summary.head(10)
How do we interpet xG and xGOT together?
It’s quite straightforward. If a team has a high xG but a low xGOT, it suggests that they have been able to position themselves well for shooting opportunities, but their shot placement accuracy is lacking. Conversely, if a team has a low xG but a high xGOT, it indicates that they may not have been in advantageous shooting positions, yet they have demonstrated good shot placement skills.
For example, let’s consider the Brazil vs. Serbia match. Brazil generated an xG value of around 2, while the xGOT value exceeded 3. This implies that the Brazilian players were able to place their shots in highly favorable positions, making it challenging for the opposing goalkeeper to make saves. However, since Brazil only managed to score 2 goals, it suggests that the Serbian goalkeeper performed exceptionally well by stopping difficult shots, as they were expected to concede 3.3 goals based on the xGOT metric.
Which Players are Overachievers and Underachievers?
To identify overachievers and underachievers among the players in the World Cup 2022, we can leverage the combination of xG and xGOT metrics. Here’s the approach we can follow:
- Calculate the xG and xGOT values for every shot, including the player’s name as a column in our dataset.
- With the player shots data at hand, group the data by player to summarize their performance.
- Visualize the data using scatter plots, with xG and xGOT as the axes.
new_features = ['player_name', 'x', 'y', 'outcome_name', 'sub_type_name', 'body_part_name', 'under_pressure', 'technique_name', 'shot_statsbomb_xg', 'end_y', 'end_z']
df_shot_player = pd.DataFrame(columns=new_features)
for id in df_match['match_id']:
mask_shot = (df_matches[id]['event'].type_name == 'Shot') & (df_matches[id]['event'].period <= 4)
shots_temp = df_matches[id]['event'].loc[mask_shot, new_features]
df_shot_player = pd.concat([df_shot_player, shots_temp]).reset_index(drop=True)
# calculate xg and xgot
df_shot_player['our_xg'] = df_shot_player.apply(lambda row:calculate_xg_adv(row), axis=1)
df_shot_player['our_xgot'] = df_shot_player.apply(lambda row:calculate_xgot(row), axis=1)
# define goal or not
df_shot_player['goal'] = df_shot_player.apply(lambda row:1 if row['outcome_name'] == 'Goal' else 0, axis=1)
player_summary = df_shot_player.groupby('player_name')[['our_xg', 'our_xgot', 'shot_statsbomb_xg', 'goal']].sum().reset_index()
alt.Chart(player_summary).mark_circle(size=60).encode(
x='our_xgot:Q',
y='our_xg:Q',
color=alt.Color('goal', scale=alt.Scale(range=["#ff8000", "#00b020"])),
tooltip=['player_name', 'our_xg', 'our_xgot', 'shot_statsbomb_xg', 'goal']
).interactive()
Upon analysis, we find that Lionel Messi consistently stands out, having the highest xG and xGOT values, indicating that his positioning and shot placement skills are superior to those of other players. Mbappe, on the other hand, has a higher xG than xGOT, suggesting that his positioning skill is better than his shot placement. Lewandowski exhibits an xG of around 3 but an xGOT of only around 1, implying that his shot placement in this competition has been poor. Conversely, Enzo Fernandez achieved an xGOT of around 1.5 with just approximately 0.4 xG, indicating that his shots were placed skillfully.
Now, let’s delve into identifying the overachievers and underachievers of this competition using our xG and xGOT model. We will calculate the average of xG and xGOT and then subtract the actual goals scored by each player from those averages.
player_summary['difference'] = player_summary['goal']-(player_summary['our_xg'] + player_summary['our_xgot'])/2
# the overachievers
player_summary.sort_values('difference', ascending=False).head(10)
# the underachievers
player_summary.sort_values('difference').head(10)
According to our models, Mbappe emerges as the greatest overachiever, surpassing his expected goal tally by scoring 8 goals while being expected to score around 4.5 goals. Cody Gakpo and Julian Alvarez also deserve recognition for exceeding their expected goals by more than 2.0 goals.
On the contrary, the greatest underachiever is Jamal Musiala, who was unable to score any goals despite being expected to score around 1.6 goals. Lautaro Martinez and Romelu Lukaku had disappointing performances as well, failing to score despite expected goal values of at least 1.2. These three players stand out as underachievers, falling short of their expected goal contributions by more than 1.0 goals.
Closing
Well, that’s all for this article, I hope now you understand how xGOT model is built, you can try to make your own with different dataset or different algorithm. Any feedback is very much welcome. Thank you for reading, cheers!