Published in

CodeX

# Introduction

Football is the most widely watched sport in the world. Football prediction has become increasingly popular in the last few years. In this open world of the internet, many exciting models have been proposed that evaluate the probability of a football team winning, losing, or drawing a match. This blog will focus on analyze football games with xG(Expected Goals). Expected Goals models can predict whether a shot can result in a goal and help analyse football games.

# What are Bayesian Networks

Bayesian Networks has become a popular way of probability models the relationship between a set of variables for a particular domain. The Bayesian Network is a Directed Acyclic graph whose nodes define a particular random variable and directed edges connecting nodes represents the conditional dependencies between nodes.

The Bayesian Network represents joint distribution via chain rule :

The main idea of Bayesian Networks is from Thomas Bayes rule(Bayes Theorem) :

• P (X) is the prior probability or marginal probability of X.
• P (X | Y ) is the posterior probability or conditional probability of X given Y.
• P (Y | X) is the conditional probability of Y given X (likelihood of data Y).
• P (Y) is the prior probability or marginal probability of data Y (the evidence).

Consider a real-life example of Bayesian Network where Nodes of the network are: Win Lottery(L), Rain(R) and Wet Ground(W).

Above Bayesian Network tells us that Joint probability factorizes in the form : P(L,R,W) = P(L) P(R) P(W | R). Here “L” is completely independent of every other event whereas “R” and “W” are dependent i.e. If it rains then there is a possibility of Wet Ground.

Let us consider another example of Bayesian Network which related to Soccer. Given two events harsh tackle and yellow card affect the red card. The probability of 0.3 for P(T) means that it is a 0.3 chance of tackle committed by a player considered harsh and there is a 0.7 chance the tackle committed is considered legal. Given the player committed harsh tackle and already has
a yellow card there are 0.97 chances that the player will receive a Red Card.

We can assume Harsh Tackle and Yellow Card are independent events. Further, if a player doesn’t have a Yellow Card and doesn’t commit harsh tackle, then the possibility of getting a red card is nearly impossible (0.02). We can calculate the probability of getting a Red Card by using the Bayes rule and considering the dependency among the nodes,

P(R) = P(R | TY) *P(T n Y) + P(R | T’Y) * P(T’ n Y) + P(R | TY’) * P(T n Y’) + P(R | T’Y’) * P(T’ n Y’)

# What is Expected Goals (xG)?

xG is the probability of a shot to be a goal, i.e., the chances of a shot ending inside the net. xG is calculated by the ratio of how many times a player has scored from that position to how many times he attempted that from that position.

An xG can range between 0 and 1, with xG of 0 being miss and xG of 1 being guaranteed goal. Suppose if xG is 0.2, that means out of 10 shots only two converted into goals. The probability of a shot that will convert into a goal depends upon various characteristics like Location of the player, body parts used, type of attack, free kick, penalties, type of assist etc.

xG Model:

A human brain can intuitively tell by watching the game which chances were more or less likely to be scored. Like a human takes care of various things while predicting like how close the shooter was, who the shooter was, what was the angle to goal, was a weak foot, etc. Similarly, a machine learning algorithm can also learn this if given the most important features that affect the possibility of a goal.

We can model each shot as Bernoulli random variable X where, X = 1 is for Goal and X = 0 is for a miss. If we want to calculate the expected goal of a team we can sum up multiple Bernoulli random variables (different for each shot) and can estimate the number of expected goals for a team. For example,consider a team-A whose probabilities of ending up in goal for 5 shots are : (0.02, 0.1, 0.3, 0.4, 0.1) then expected goal value for Team-A is 0.92.

Ley pᵢ be the probability that a team scored a goal in iᵗʰ attempt, yᵢ = Bernoulli(pᵢ) and Fᵢ be the events that a shot depends upon then expected no of goals scored by a team is :

# How xG is used?

• Comparing xG to actual goals score can indicate a player’s luck. For example, If a player has higher xG than actual goals scored, it will be a result of poor finishing or bad luck, similarly, if a player is scoring more than its xG, it will be a result of good luck or individual brilliance.
• xG can be used to evaluate teams and predict future performance. For example, if a team is performing below average at the start of the season, a look at their xG could reveal whether that run is likely to continue or not.
• xG can be used for the prediction of match outcome, which can be further used for betting.

# Example of Bayesian Network :

There are a large number of factors that could affect the outcome of a football match. One of the difficulties in finding the relationship between different attributes. Here are the main factors which affect the outcome of a football match: 1. History of Last 5 matches, 2. Home Game, 3. The psychological state of Players, 4. Average match in a week, 5. Form of key players, 6. Average home goals, 7. Average win rate, 8. Performance of all player of the team etc.

From the above example attributes, here is how a Bayesian Network will look like,

# Method Description :

Exploring Data set :

Dataset Used: https://www.kaggle.com/secareanualin/football-events

Length of Data = 229135 x 28

event_type
0 Announcement,1 Attempt, 2 Corner, 3 Foul, 4 Yellow card, 5 Second yellow card ,6 Red card, 7 Substitution, 8 Free kick won, 9 Offside, 10 Hand ball, 11 Penalty conceded
event_type2
12 Key Pass yes , 13 Failed through ball, 14 Sending off, 15 Own goal
shot_place
1 Bit too high, 2 Blocked, 3 Bottom left corner, 4 Bottom right corner, 5 Centre of the goal, 6 High and wide, 7 Hits the bar, 8 Misses to the left, 9 Misses to the right, 10 Too high, 11 Top center of the goal, 12 Top left corner, 13 Top right corner
shot_outcome
1 On target, 2 Off target, 3 Blocked, 4 Hit the bar
location
1 Attacking half, 2 Defensive half, 3 Centre of the box, 4 Left wing, 5 Right wing, 6 Difficult angle and long range, 7 Difficult angle on the left, 8 Difficult angle on the right, 9 Left side of the box, 10 Left side of the six yard box, 11 Right side of the box, 12 Right side of the six yard box, 13 Very close range, 14 Penalty spot, 15 Outside the box, 16 Long range, 17 More than 35 yards, 18 More than 40 yards, 19 Not recorded
bodypart
1 right foot, 2 left foot, 3 head
assist_method
0 None, 1 Pass, 2 Cross, 3 Headed pass, 4 Through ball
situation
1 Open play, 2 Set piece, 3 Corner, 4 Free kick

Data Preparation :

As our model is only based on shots, we will only consider event_type = 1 (i.e. when an attempt is made). The probability of a shot to be a goal depends on various things and here we will only consider few factors for now like — 1. Location, 2. Bodypart, 3. The situation, 4. Assist_method. These all columns will consider many events stated below which are going to consider :

'fast_break','no_assist', 'assist_pass', 'assist_cross', 'assist_header''assist_through_ball','open_play', 'set_piece', 'corner', 'free_kick', 'loc_centre_box', 'loc_diff_angle_lr', 'diff_angle_left', 'diff_angle_right','left_side_box', 'left_side_6ybox', 'right_side_box', 'right_side_6ybox', 'close_range','penalty', 'outside_box', 'long_range', 'more_35y', 'more_40y', 'not_recorded', 'right_foot','left_foot', 'header'

Now we will select the top 5 features for our Bayesian Networks, which are — fast_break, loc_centre_box, close_range, penalty, outside_box, is_goal and convert them into binary.

Our New Data Frame looks like this :

Model Dependency :

sklearn, pandas, pgmpy, numpy

Model Implementation:

Frow our dataset it is clear that we need a binary classifier. Each entry in our dataset will contain various events and a variable indicating whether it is a goal or not. It is difficult for us to choose which classifier will be best for our problem. So we will be choosing Logistic Regression and Bayesian Network. Logistic Regression works well with linear dependencies when the data is categorical and Bayesian Networks can predict joint probability over discrete outputs. Apart from selecting classifier, we will be measuring difference in performance of this classifier. Next step is to partition dataset into corresponding training and test set to make sure our model behaves well on unseen data.

# Results :

Bayesian Networks DAG :

Bayesian Networks :

Model Accuracy :

Train data length = 226843
Test data length = 2292
No of Goals in Test Data = 257
No of Goals correctly predicted = 27
No of Not-Goals correctly predicted = 2020
Accuracy for Goals = 10.5%
Model accuracy (including goals & not-goals)= 89.3%

Confusion Matrix :

True Positive = 27
True Negative = 2020
False Positive = 15
False Negative = 230

Logistic Regression :

Model Accuracy :

Train data length = 226843
Test data length = 2292
No of Goals in Test Data = 257
No of Goals correctly predicted = 21
No of Not-Goals correctly predicted = 2029
Accuracy for Goals = 8.1%
Model accuracy (including goals & not-goals)= 89.4%

Confusion Matrix :

True Positive = 21
True Negative = 2029
False Positive = 6
False Negative = 236

The result we have obtained is by using the same parameters, the train data length and features are exactly the same in both Logistic and Bayesian Networks. We can see both models predict almost the same results.

# Dealing with Imbalanced Dataset:

Imbalanced Dataset is a type of Dataset where labels are not balanced. For example, out of 40 shots in a match only 1 to 3 result in goals. In such cases, the minority classes are more important than the majority classes. Using accuracy to evaluate our model is not a good measure, for example, if our model predicts all 40 shots are non-goal then also it can achieve an accuracy of 85% in fact it hasn't learned anything.

To deal with an imbalanced dataset we can use the inverse of label distribution. In our case, we have a 40:1 ratio so we will be assigning 40 weight to minority and 1 weight to majority label. If our model gives the wrong prediction for minority label then the penalty would be 40.

# Concluding Remarks and Future Work :

Our model emphasizes the importance of Bayesian Networks and their capability. This model can be improved by adding more events. For example, we can also consider the position of the opponent player by using it when the shot is being attempted which will decrease the probability of a goal even though the shot is attempted inside the box. Thus, with the help of an improved dataset, we can increase prediction that teams can use for their improvements Some of the future work are :

• Create a more centred model towards the player — we can use the players attribute to make a better prediction. For example, players’ shooting accuracy or the knowledge about strong foot can better predict team performance.
• One future extension can be, predicting expected assists, expected red card, expected penalty etc.
• We can use Bayesian Networks to predict future events, like how many shots will be attempted.

References :

[1] Nazim Razali1, Aida Mustapha, Sunariya Utama, Roshidi Din
[A Review on Football Match Outcome Prediction using Bayesian Networks].
[2] Nazim Razali1, Aida Mustapha1, Faiz Ahmad Yatim2, Ruhaya Ab Aziz
[Predicting Football Matches Results using Bayesian Networks for English Premier League].
[3] Farzin Owramipur, Parinaz Eskandarian, and Faezeh Sadat Mozneb [Football Result Prediction with Bayesian Network in Spanish League-Barcelona Team].
[4] Expected Goals
https://footballphilosophy.org/

GitHub: Football Analytics

--

--

## Sumeet Verma

Mtech CSA IISc Bangalore