Predicting the Outcome of the English Premier League By Using Monte Carlo Method (in R)

Kristóf Menyhért
The Startup
Published in
10 min readJan 4, 2021

Introduction

I really like football and sometimes I bet on matches. Nowadays bookmakers offer a lot of markets, basically, you can bet on nearly anything from ‘Who will be the next president of the USA?’ to 'Which country is going to win the next Eurovision song contest?’ But sports events are still probably the most popular ones, especially football.

Two other things that I like are programming and predictions. I also have some background in math, statistics, and probability theory. Actually, it is more than a hobby I use them almost every day since I am a data scientist.

In this article, I am going to write about how I used Monte Carlo Method to predict the final standings of each English Premier League club using R.

I also attached the coding part so you can follow along with what I did and you can run it on your own as well.

I will explain how you can modify the probabilities for each remaining match to reflect your predictions.

Then once you run your own simulation and look at the results you can compare them to the actual odds on the market. So if you trust yourself you can even bet on the teams which are in favor and might win some money.

Background — Understanding how betting works and how to be profitable in the long run

One concept that is essential to understand when it comes to profitable betting that you need to look for value bets.

What does this mean in practice? That you should look for events that you think are mispriced. Then you bet on outcomes where the real odds are in your favor.

Let me give you a small example. There is a football match in the near future where Southampton is going to play against Liverpool. On Betfair — which operates the world’s largest online betting exchange — you can find the following odds:

Odds for Southampton vs Liverpool

We can convert these odds to probabilities by using the following formula:

Odds to probability

In this case, the odds on the betting site reflects the following probabilities for a Liverpool win:

Chance provided by Betfair for Liverpool to win.

So if you think that Liverpool has higher chances to win than 62.1 % then you should bet on this outcome. However, if you think that the chances are lower you can easily see that you should bet against it.

If you are correct you are on the right track to win money in the long run.

The (biding) market of the Winner of English Premier League 2020/21

In this article, I am going to predict the final standing of the teams in the Premier League (PL) and I would like to compare my calculated probabilities to odds that are offered on Betfair.

As of today (2021.01.04.), the league standings look like this:

Premier League current standings

And the odds on Betfair are the following for the winner of the PL:

Premier League Winning Probabilities (at 2020.01.04.)
Premier League Winning Probabilities (on 2020.01.04.)

We would like to know that the odds are in our favor or not. Or in other words, should we bet on Liverpool, Manchester Utd, Man City or any other team?

If you are interested, there are other markets on Betfair which you might want to look at. (e.g.: top 2 finish, top 3 finish, etc.)

Okay, but what is the Monte Carlo Method (MCM)?

I am not a mathematician so forgive me if I am not using the correct terms but this is how I would tell what MCM is in short:

Let’s say that you would like to know what is the chance of rolling exactly 12 by rolling 5 dice. It would be possible to calculate this by using mathematical formula(s), but let’s be honest it can be complicated. But another approach is to simulate 1 million 5 dice rolls and check how many of those would be exactly 12. In this way, you can approximate the chance by counting the favorable outcomes and divide it by 1 million. The more you simulate the closer you will get to the real probability. In this case, it will be very close to the real probability since 1 million rolls is a lot. This is an example of how you can use the Monte Carlo Method.

It is commonly used for calculating the chance of outcome(s) of non-deterministic processes where calculating the exact chance(s) using the exact mathematical formula(s) is hard or even impossible.

Many times you can approach these problems by the MCM. You can simulate the results by drawing samples from a known probability distribution and then simulate the final outcome many many times. Then you can easily see the chances of occurrence for each outcome by simply counting them and do a division.

Btw: you can use this method for many things, including approximating Pi. Learn more on the Wikipedia page of Monte Carlo Method.

Let’s start programming

In this tutorial, I will try to explain how I simulated the final standings of each Premier League club step by step.

I used R since this is the most comfortable language for me to use.

First, grab the data and load it to R and then create some functions that we can use later in the simulation part.

Step 1: Load packages

For this purpose, only 3–4 extra library is enough to use.

Step 2: Read in data

You can find the CSV file that I used in the following link at my Github repository. I recommend that you download the CSV and change the path.

Initial table (the .CSV) that we have

The .CSV file includes every Premier League match. Even those which have not played yet. Those that are not played yet are indicated with an ‘x’ in the ‘notyetplayed’ column and the chances of the outcomes for each match are written in the ‘home_chance’, ‘draw_chance’, and ‘away_chance’ column.

I manually imputed these chances/ probabilities based on what I think. Sadly it is not based on any model it just what I think of each match. A rough estimation. So if you do not think that they are correct you can modify or estimate these probabilities on your own. (If you know any method to do that please let me know)

It is important to mention here that the Monte Carlo Simulation is using these probabilities to calculate the final standing chances so basically everything is dependent on these imputed values. So if they are very far from the real odds then the final outcome may also be far from reality.

Step 3: Some Cleaning — Extract Result

We need to tell which one is the winning team based on the result in the ‘Result’ column so I am extracting the numbers from the string.

Create a draw (d), home (h), and away (a) indicator

I am creating a new column that indicates which team won a given match.

Step 4: Get team names and initialize the league table

These are the 20 teams in the league.

"Fulham"         "Burnley"        "Man City"       "Crystal Palace" "Liverpool"      "West Ham"       "West Brom" "Spurs" "Sheffield Utd"  "Brighton"       "Everton"        "Leeds"          "Man Utd"        "Arsenal" "Southampton"    "Newcastle"      "Chelsea"        "Leicester"      "Aston Villa"    "Wolves"

Step 5: Create a function that calculates how many points does each team have in each of the rounds

We need to write a function that calculates the points for each of the rounds. (Be careful this function only calculates points for those matches that are already have been played. Since it is using the ‘hda’ column.)

Let's use the function and inspect the output

This function will return a table with the simulation_id, how many matches each team played at that given point, and the points that they have.

output table with calculated points for each round

Step 6: Create a function that extracts the final position by using the table that we just created before

If this function works correctly this calculates the Team’s current league standings with the corresponding points.

Simulation part

We will use the previously created function in the simulation part.

Step 7: Write probabilities where it is not present

We need to impute 100 (%) for those matches that are already played for the corresponding cells.

A small detour — The sample function

The sample function is one of the key functions that we are using. With the help of this function, we can set the probabilities for home (h), draw(d), away (a) outcomes. So if we set the sample size to 10000 with the arguments below we expect around 6000 home wins, 2000 draws, and 2000 away wins to happen.

   a    d    h 
2017 1984 5999

Is that the case? It is, just randomness plays a role in it.

Step 8: Writing the simulation function

Out next task is to create the simulation function.

With the help of this function, we write the outcomes (h/d/a) using the probabilities that we imputed for each match. We do this multiple times so we will create many possible outcomes. We can set how many times we would like to simulate with the times' argument. I recommend using a minimum of 100 but if you have some time you can even set it to a higher number then your predictions will be more accurate. (For me it takes a couple of mins to run, so be patient)

The output is a table where I have a sumulaton_id column which helps me to identify different simulations. And of course, you can see the simulated outcome in the ‘hda’ column.

Step 9: Calculate standings

Now the only thing we need to do is to calculate the league standings for each of the simulations using the ‘calculate_points’ function.

The results

Now we have everything put together. We have the simulated progression of each team's standings for each of the rounds, and we have these simulations many times. In this case 300 times.

We can see how the league standings are evolving in each round. This data might be useful for other purposes, but if we are interested only in the final standing of each team we only need to have the last round standings.

There are 20 teams in the English Premier League so this number is 38. This is the sum of all matches that one team plays in the league.

How?

Step 11: Get only the last round standings

Side note: I do not take into account what happens when there are equal points. The official rule says that if any clubs finish with the same number of points, their position in the Premier League table is determined by goal difference. Right now I am not dealing with this.

Count the occurrences

Show the occurrences in percentages

Convert the percentages to odds

Conclusion and Takeaway

Do you remember the question that I asked at the beginning?

Would you bet on Liverpool or Man City?

Based on the simulation model, you should bet on Liverpool.

Why? Because the implied odds of winning the league is better (lower) than what you can find on Betfair. In other words, Liverpool has a higher chance to win the League than the odds that are given on the website.

Calculated odds based on the model
Premier League Winning Probabilities (at 2020.01.04.)
Premier League Winning Probabilities (at 2020.01.04.) on Betfair

It does not mean that Liverpool will win for sure, but the chances are higher based on the model than what the current odds reflect.

Misc

I am not sure that I am using the correct probabilities for each of the matches I am a football fan who knows about machine learning, statistics, and some programming.

Estimating the probabilities of home wins, draws, and away wins are hard and might be far from reality.

But you can play with the data and simulate different conservative and not so conservative scenarios by running the Monte Carlo Simulation.

The Monte Carlo Simulation itself will be correct but the underlying chances might not be, so use the results at your own risk and have fun!

Update

The numbers above are not reflecting the current probabilities anymore. That day when I published the article Liverpool lost against Southampton and this completely changed their outlook.

I re-run the model today (2020.01.05.) with the updated results and this is what I got:

New results as of 2020.01.05.

ReSource

Scripts and CSV:

https://github.com/krinya/pl_simulation

--

--