Predicting Premier League match wins using Bayesian Modelling

10 min readApr 1, 2024

Photo by Samuel Regan-Asante on Unsplash

Introduction

Football is the greatest and most-watched sport in the world, and according to Fifa, it has around 5 billion fans worldwide.

The Premier League is arguably the most difficult and exciting. That is because it thrives on its unpredictability. It's the league where underdogs can triumph and giants can fall.

Who would have predicted these results? — Source google.

And that's the reason why we love to watch it. Anything can happen from the starting minute till the last whistle. And as a Man Utd fan, I know that well.

Having said that, I was looking for a weekend project to do that was difficult and where I could combine two things I like.

Football and Bayesian Modelling.

In this article, I want to take you through my attempt and thought process for this problem. Predicting Premier League match wins using Bayesian Modelling.

There are similar articles that predict football match wins. Many of them like this article on Dataquest use traditional machine learning models like Random Forrest algorithms. The closest article I could find was by Pol Marin, who used Bayesian Modelling to predict Champions League wins.

From what I could find a good accuracy metric for this type of project is around 50–60%.

The Data

The data came from football-data.co.uk. Data from the current 2023–2024 season was used.

Each row represents a Football match played. With columns representing the details of the match like HomeTeam, AwayTeam, FTHG (Full-time home goals), FTAG (Full-time away goals), FTR (Full-time results) and more.

The first few rows and columns that were used for this project.

The data also contains details about match bet odds. These were discarded from this project because the aim is to let the model learn from team performance rather than information from betting odds.

Pre-processing and Feature Engineering

The first task was to get the data into a format needed for a Machine Learning model. My preferred tool for data processing is Python.

Essentially, the go-to questions that I ask myself on any Machine Learning project are (when it comes to the data):

What does each row represent?
What does the target column represent?
How do the features describe each row and how are they related to the target?

For this project, we needed to know what team was playing, who was their opponent and a binary target column indicating if the team playing won.

This means, for each match, we needed two rows. One row from the perspective of the home team and another from the away team.

So for example, a match like Manchester United vs Arsenal will have two rows. In one row, the team playing will be Manchester United and their opponent Arsneal. In the next row, the team playing will be Arsenal, their opponent Manchester United.

The target column will be a binary column that represents if the team playing won (1) or didn't win (0).

In the current form, the dataset describes the statistics of each match. The dataset tells us, how many goals the team_playing has scored, shots they took, number of shots and goals conceded and more.

When making future predictions we won't have these statistics. Because these games won't have been played yet.

So the next task was to feature engineer statistics we could get before the game was played.

So I grouped the dataset by the team_playing column and then calculated the rolling statistics of the past 5 games (excluding the current row). These included mean, std, min, and max. We had a total of 90 features.

This resulted in the first five rows for each team being missing, so I filled them using the next available value.

The ‘**opponent_ft_goals_rolling_mean**’ column shows the average goals conceded by the team, based on a rolling average from the last 5 games, while the ‘**team_playing_ft_goals_rolling_mean**’ column displays the team’s average goals scored, also calculated as a rolling average over the last 5 games.

So now, each row tells us, the team playing, their opponent, the mean number of goals scored by the team playing, goals conceded by the team playing, shots taken and more from the past 5 games.

Choosing to use the last 5 games for the rolling averages was chosen semi-arbitrarily and more experimentation could be done to find an optimum value.

Feature Selection

Although 90 features were extracted. Only 6 features were used for this Bayesian logistic regression. This was because there was a lot of multicollinearity. See the figure below.

Feature Correlation Matrix. Only looking at the rolling mean features.

As you would expect, a team that had been winning recently would have its target_rolling_mean highly correlated with team_playing_ft_goals_rolling mean etc.

The final features set I went with was:

The names of the team playing would not be used as features for this project. This is because they would require encoding.

Either done through label encoding, where each team gets given a unique number. Since we are using a linear model, it will assume a linear relationship between the target and features. It would incorrectly interpret numerical values as having an ordinal relationship. In the context of team names, it wouldn't make sense.

It would have been okay with a non-linear model like a tree-based model since they can capture non-linear relationships.

Another way would be through one hot encoding but this would have increased the dimensionality of our dataset more. For these reasons, I decided to drop them.

The Model: Bayesian Logistic Regression

We are trying to predict match wins. Either a team won or didn't win. Thus our target is binary. We define y to be our target and we assume it follows a Bernoulli distribution with parameter p.

To model the probability p of a team winning a match, we employ a logistic regression model. In logistic regression, the log odds of the event of interest (in this case, a team winning) are modelled as a linear combination of predictor variables. Specifically, we use the logit function to transform the probability p into the log-odds scale.

Here, each β is the coefficient associated with each predictor variable X. These coefficients represent the change in the log odds of winning for a one-unit change in the corresponding predictor variable, holding other variables constant.

To account for uncertainty in the estimation of these coefficients, we introduce weakly informative prior distributions for the coefficients β. In our model, we assume a normal prior distribution for each coefficient, centred at 0 with a variance of 1000.

I employed MCMC using the rjags package in R to estimate the parameter coefficients and ran 2 chains with 100000 samples. A burn-in period of 10000 and thinning set to 20 to reduce autocorrelation. The rjags model string is below.

# JAGS model
model_string <- 'model{
  for(i in 1:N){
    y[i] ~ dbern(p[i])
    logit(p[i]) <- eta[i]
    # linear predictors using inner product notation.
    eta[i] <- inprod(X[i,], beta[]) 
  }
  
  # Weakly informative priors for coefficients
  for(j in 1:P){
    beta[j] ~ dnorm(0, 0.001)
  }
}'

For training, all data up to and including matchday 28 of the current 23–24 season was used. Matchday 29 will be our test set. This is what we will use to assess our models.

We are training the model on 516 matches and then testing on 50.

Results

First, we need to validate our model. To assess convergence the Gelman-Rubin statistic was used. All point estimates are less than less than 1.1 suggesting good convergence. Trace plots and moving average plots were also assessed visually.

Gelman Rubin Statistics. All <1.1 suggest good convergence.

Trace plots of beta[1] to beta[4] on the left. Moving average plots of beta[1] and beta[2] on the right. See full results on my GitHub.

From a visual analysis of the trace plots, we see that they look stationary and mixing seems to be good (They have that ‘hairy caterpillar’ look).

The moving average plots also show signs of convergence. We see the two chains start from different locations but come together.

A high autocorrelation means that successive samples are similar to each other. This can result in the chain taking longer to converge as it might get stuck in a certain place.

At lag 20, we see autocorrelation is quite high for most parameters except beta[2]. At lag 100, autocorrelation is at a minimum for all parameters except beta[1]. At lag 200 we see that the autocorrelation is minimal for all betas. There's quite a slow decay here.

The posterior means were used to calculate the probabilities of the test set. The matches from matchday 29.

I used a threshold of 0.5 to classify the model's probabilities. So turning them from probabilities to binary predictions. The full probabilities from the test set are available here.

The model achieved an overall accuracy of 78%. The recall was 61% which means it correctly identified 61% of the wins in the test set. The precision was 73% which means when the model predicted a win, it was correct 73% of the time.

We managed to get an overall F1 score of 67%. Not too bad.

Let's take a look at some of the distributions from the test set.

This first result looks reasonable. Man City are currently the favourites over Man United. Manchester City ended up winning this game 3–1.

Taking a look at Fulham vs Tottenham. I think it's safe to assume that Tottenham would have been favourites to win this game.

Tottenham who are competing for a Champions League spot, sit in the top 6 while Fulham sit mid-table. The model predicted Fulham to win with an expected probability of 0.64 while Tottenham had 0.34.

Fulham won this game 3–0. Our model predicted that correctly.

Future Predictions

Now comes the fun of predicting next week’s (match week 30) fixtures. These games haven’t been played so let’s see.

Man City vs Arsenal

The expected win probability for Man City is 0.77 and Arsenal 0.79.

This game is tricky as both teams are on a great run of form and challenging for the league. The winner of this game takes second place in the league. As we can see, there is quite a lot of uncertainty in the model’s predictions because it predicts both teams to win. Although, Arsenal takes the edge in the expected win.

Brentford vs Man Utd

The expected win probability for Man Utd is 0.49 and Brentford 0.29.

Both teams have struggled for form recently with Manchester United winning just 1 of their last 5. Brentford winning None. The expected probability of either team winning is less than 0.5 for both teams.

Liverpool vs Brighton

The expected win probability for Liverpool is 0.84 and Brighton 0.39.

Liverpool currently sit first in the table. They are by far the favourites and the model shows that.

Chelsea vs Burnley

The expected win probability for Chelsea is 0.52 and Burnley 0.20.

Chelsea takes the edge here.

Aston Villa vs Wolves

The expected win probability for Aston Villa is 0.46 and Wolves 0.36.

Aston Villa takes the edge.

Improvements and Next Steps

For this model, we used 6 features. Next, we could apply feature selection like Boruta or MRMR. Also, some of the features I used were correlated with each other, so using a better feature selection process and more advanced feature engineering would be good.

Also, we used a rolling window of the last 5 games. We could experiment with different window sizes to see how that affects model performance.

I would have liked to add features like ball possession, player ratings and league position at the time of the match.

Another idea could be to train a logistic model per team rather than one general model.

Thank you for reading. All the code for preprocessing, feature engineering and modelling are available on my GitHub.

Find me on LinkedIn, GitHub and my Portfolio. Always open and appreciate feedback.

References

Said, A. (2024). adilsaid64/bayesian-football-match-prediction. [online] GitHub. Available at: https://github.com/adilsaid64/bayesian-football-match-prediction
Marin, P. (2024). Using Bayesian Modeling to Predict The Champions League. [online] Medium. Available at: https://towardsdatascience.com/using-bayesian-modeling-to-predict-the-champions-league-8ebb069006ba.
Hu, J.A. and J. (n.d.). Chapter 12 Bayesian Multiple Regression and Logistic Models | Probability and Bayesian Modeling. [online] bayesball.github.io. Available at: https://bayesball.github.io/BOOK/bayesian-multiple-regression-and-logistic-models.html
‌Plummer, M., Stukalov, A. and Denwood, M. (2023). rjags: Bayesian Graphical Models using MCMC. [online] R-Packages. Available at: https://cran.r-project.org/web/packages/rjags/index.html
www.youtube.com. (n.d.). Predict Football Match Winners With Machine Learning And Python. [online] Available at: https://www.youtube.com/watch?v=0irmDBWLrco&ab_channel=Dataquest
football-data.co.uk. (n.d.). Football Betting | Football Results | Free Bets | Betting Odds. [online] Available at: https://football-data.co.uk/.
FIFA (2021). The football landscape — The Vision 2020–2023. [online] FIFA Publications. Available at: https://publications.fifa.com/en/vision-report-2021/the-football-landscape/.‌‌