Predicting Football Results with Random Forest

Nicholas Utikal
13 min readDec 26, 2019

Project Overview

Football betting has been around since the invention of football in the 19th century. It is present in commercials, as team sponsors or in betting shops around the corner (at least in Germany). Nevertheless, I think I have a solid understanding of the actual football trends, great overview of Europes biggest leagues and know most names of the players of my favourite teams (Borussia Dortmund and Hertha Berlin), I never tried to bet money on the outcome of a game or any other game related event. I always missed solid proof, that my gut feeling might be correct.

Googling, I run into multiple Data Scientist and Software Engineers explaining the best and most common ways, to getting started with Football-Results-Predictions. One of them was using the Random Forrest Regressor, which operates by building a large amount of decision trees during training and taking a different part of the dataset as the training set for each tree. Decision Trees have the advantage that they scale very well with additional data, they are quite robust to irrelevant features and they are interpretable.

Since we have the features(Goals in past games, average goals per game…) and targets (Home and away goals), we are dealing with a supervised regression machine learning problem.

We are going to use for our predictions free and publicly available data from this website. It is updated after every match-day.

If you want to take a closer look into the code feel free to visit the Github repository.

In another blogpost I managed to obtain an accuracy of ~64% clustering the teams with an archetype analysis and then using XGBoost to predict results. You can find it here.

Jupyter-Notebook Structure

I have divided my work in five parts. In the first jupyter-notebook ‘load_and_clean.ipynb’ I load and explore the raw CSV files which I downloaded from the webpage mentioned in the paragraph above.

Then I saved the result with pickle locally in the same folder and load this pickle file into ‘home_team_prediction.ipynb’ and ‘home_team_prediction.ipynb’, where I added additional columns and created with the Random Forrest Regressor decision trees/forests in order to predict, separately, goals scored by the home and away team. Then I also save the results via pickle and load those pickle files into the last jupyter-notebook called ‘result.ipynb’ to join the data and to create an excel file with the final results and predictions.

Since I am doing similar work in ‘home_team_prediction.ipynb’ and ‘home_team_prediction.ipynb’, I created a python package, where I create reusable functions, which I then can import and use in those files.

The questions I will target:

How to predict the number of goals correctly for any given team.

Is it possible to predict the outcome of a game(win, draw or lose) correctly from predicted numbers of goals of each team at least 50% of all times?

Data Exploration

From this website we can download CSV files with match related data to each league game from all the ‘main’ leagues, such as England, Spain, Germany etc. and from extra leagues, such as Argentina, China, Denmark etc. It comes with all kinds of information including ‘Home Team Shots’, ‘Away Team Yellow Cards’ or betting odds data etc. For our analysis we will focus on the following columns:

year: 2018 and 2019 for all data points
month: number for month of the year
day: number for day of the year
HomeTeam: team hosting the match
AwayTeam: team visiting
FTHG: Full Time Home Team Goals
FTAG: Full Time Away Team Goals
HST: Home Team Shots on Target
AST: Away Team Shots on Target

Besides the given columns, we also have to create two columns with HTGDIFF (Home Team Goal Difference) and ATGDIFF(Away Team Goal Difference) in order to measure the overall performance for a team in the last games. We just have to subtract the FTHG from the FTAG column and vice versa. This leaves us with a table that looks like the following:

Since the datasource has already clean and complete data, there is not much basic data wrangling and anomalies to expect. Nevertheless it is important to check through visualisation, using for example the pandas scatter_matrix function.

Examining the quantitative statistics and the graphs, we can feel confident in the high quality of our data.

Also calling .describe() on our initial dataframe is helpful to have a first impression of our data we will work with:

Data Preparation

While I was exploring and looking for ways to predict football results, I found this article. The person uses the Poisson Distribution as a classifcator to predict win, draw or loose between two teams: ‘We just need to know the average number of goals scored by each team and feed this data into a Poisson model.’ Modelling the result for one match would look like this:

Seeing the diagram, I asked myself if it would be possible to predict the outcome of the games mostly with historical averages (per game) and predicting goals, analysing both teams separately. Since we are going to work with goal differences, we already include the performance of the defence and/or of the opponents strikers.

In order to create valuable features for each past game, we are going to add additional columns, which will help our Random Forrest Regressor Model to predict results with a meaningful accuracy.

AVGHTGDIFF & AVGFTHG per past game

In this paper I read, that data scientists in the USA, obtained the best results in sports predictions using an average of a feature across the last 20 games. Since I am predicting a goal score for hosting and visiting teams separately and in most European football leagues there is an alternating home and guest game, I will calculate the average home/away team goal difference across the last 10 hosting or guest games.

The value of row 0 in column ‘AVGHTGDIFF’ is calculated by the sum of all the values in ‘ATGDIFF’ divided by 10. With the following function I managed to calculate the rolling average per row (or for each match day):

Through using a similar function I obtained a rolling average across the last 10 games for the goals scored by the home / away team (‘AVGFTHG’).

Taking a look at the image above, it is already clearly visible that a top team such as Bayern Munich is already identified through their high AVGHTGDIFF and AVGFTHG score, compared to teams as Mainz or Paderborn, which are having a rough time in the German first Football divison (Bundesliga).

HTGDIFF, FTHG and HST for past 10 games

In order to get enough data for the model, I am also including ‘HTGDIFF’, ‘FTHG’ and ‘HST’ (‘Home Team Goal Difference’, ‘Full Time Home Goals’ and ‘Home Team Shots on Target’)values from the last ten HomeTeam games, per past match.

Through the pandas method .assign(), I managed to break the arrays with the past values for ‘HTGDIFF’, ‘FTHG’ and ‘HST’ into each row:

Which then looks like:

The shape of our data is now 561 x 33.

Define Features and Targets

Since we are trying to predict goals (in this example home goals), we have to label those as our targets and convert the series to an array using numpy. The Random Forest wont accept any other type and break.

Then we have to drop all unnecessary features, which would create noise and alter our accuracy. Through multiple iterations and feature importance analysis I identified features that have minor or none impact on the prediction. We will discuss further in the paragraph “Variable Importances in %”.

We also have to drop the column with the score goals, since we want the model to learn from related data and not just copy the results.

Calculate Baseline

After splitting our data into training and testing sets, where we allow our model to see the answers so it can learn how to predict the goals, we establish our baseline error. A baseline result can tell you whether a change is adding value. Our goal is to be with our predictions bellow the baseline error in order to measure the usability of our model. In our case a suitable baseline would be the average goals scored per home team until the given date.

This means we have to beat an average error of 1.33 goals. Otherwise we might have to take a different approach.

First Predictions on Test Data

Then I train our model with features and targets with 1000 estimations. This means our random forrest model creates 1000 different decision trees and outputs the mean as a result.

Subtracting the result of our test target data from the prediction result, we get a mean absolute error of 1.24 Goals. This means we are bellow our baseline error of 1.33 goals and that we are on a good way.

For me, it was important to take an actual look at the predicted values by inserting them into the original dataset, by recreating the ‘FTHG’ column with the predicted values. One thing is to beat the baseline error and have a decent/high accuracy, another thing is to double check the results with you football knowledge gathered over years, to verify if the results make sense.

The column ‘FTHG’ shows us our predictions for the goals that are going to be scored by the home team for weekend with 9 matches. The results look realistic, compared to the predictions I had when I started fitting my model (mostly 1’s or 0's).

Accuracy

By dividing errors by the test_target values I obtain a model accuracy of 21.28% in the first iteration. It does not to seem very high, therefore we might have to improve parameters and drop some features.

Single Decision Tree Visualising

To take a closer look, how the predictions are made we can take a closer look at one of the 1000 trees created by the model. We randomly pick tree number 10, which is limited to a depth of 4.

With this limited tree we can do predictions for all home team games. Lets walk through an example:

We want to predict how much goals ‘Fortuna Dusseldorf’ will score in the match at the 22th of December considering all the as-valuable-considered features(all columns, except HomeTeam).

Following the marked path, this decision tree predicts that ‘Fortuna Dusseldorf’ will predict one goal (rounded).

Since every tree is trained on a random subset of data points, we might have different result if we take a look at another tree. But the mean and final prediction is also one goal, how we can see in our total prediction of all 1000 trees:

I reality ‘Fortuna Dusseldorf’ made two goals. Damn we missed!

Not quite. Since we have a mean average error of 1.21 goals we are still in range and could consider our prediction for ‘Fortuna Dusseldorf’ for the 22th of December as a success.

Variable Importances in %

When I started building the model, I added a lot of features to my model hoping they all would have a positive impact on the result:

'Day', 'Month', 'Year', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG',
'HTGDIFF', 'ATGDIFF', 'AVGHTGDIFF', 'AVGFTHG','HS', 'AS', 'HST', 'AST', 'HC', 'AC', 'HF', 'AF', 'HY', 'AY', 'HR', 'AR', 'HTGDIFF_1', 'HTGDIFF_2', 'HTGDIFF_3', 'HTGDIFF_4', 'HTGDIFF_5', 'HTGDIFF_6', 'HTGDIFF_7', 'HTGDIFF_8', 'HTGDIFF_9', 'HTGDIFF_10', 'HST_1',
'HST_2', 'HST_3', 'HST_4', 'HST_5', 'HST_6', 'HST_7', 'HST_8', 'HST_9', 'HST_10', 'FTHG_1', 'FTHG_2', 'FTHG_3', 'FTHG_4', 'FTHG_5', 'FTHG_6', 'FTHG_7', 'FTHG_8', 'FTHG_9', 'FTHG_10'

The underscores behind the shortcut means n games back in the past. For example ‘HST_5’ means ‘Home Team Shots on Target’ five games ago.

In order to quantify the usefulness of all the variables in the entire random forest, we can look at the relative importances of the variables.

Feature Importance: AVGFTHG         14.2%
Feature Importance: Day 7.4%
Feature Importance: HST_1 6.35%
Feature Importance: AVGHTGDIFF 4.85%
Feature Importance: Month 4.72%
Feature Importance: HST_3 4.42%
Feature Importance: HTGDIFF_6 4.36%
Feature Importance: HTGDIFF_3 4.18%
Feature Importance: FTHG_1 4.09%
Feature Importance: HTGDIFF_1 4.07%
Feature Importance: HST_2 3.95%
Feature Importance: HTGDIFF_2 3.47%
Feature Importance: HTGDIFF_8 3.46%
Feature Importance: HST_5 3.31%
Feature Importance: HST_4 3.2%
Feature Importance: HST_8 3.12%
Feature Importance: HTGDIFF_4 2.99%
Feature Importance: HST_9 2.67%
Feature Importance: HST_7 2.55%
Feature Importance: HTGDIFF_5 2.5%
Feature Importance: FTHG_2 2.5%
Feature Importance: HST_10 2.48%
Feature Importance: FTHG_6 2.24%
Feature Importance: FTHG_9 2.09%
Feature Importance: Year 0.82%

This shows us, that a lot of the features have a minor impact on the prediction and could also be harmful, since the Random Forest Regressor Model is delicate to noise.

Through multiple iterations while calculating the cumulative importance, I am keeping all features accounted for 95% of total importance. This helps us to decrease the noise and run time of our model. While doing the feature reduction, it is importance to keep an eye on the accuracy, since this might drop.

In my case I ended up the with the features seen in the graph above. If I would have continued reducing features I would have gotten significantly below our accuracy of 21.28%.

Summing up, I am having the same accuracy (21.28%) with 25 features, then with 53.

Now it is time to improve the accuracy by using Random Search.

Random Forest Optimisation through Random Search

Explaining hyperparameter tuning is beyond the scope of this blog post, but I can briefly walk you through the steps I took to improve the accuracy by some percents.

I used the hyperparameter grid search, as explained in this article. Therefore I created a reusable function which takes as input the X_train, y_train, n_estimators, n_iters and cv:

I mostly experimented around with the amount of decision trees creates (n_estimators), the number of iterations made(n_iter) and increasing the cross validation folds.

To improve the prediction accuracy for the home team goals, I kept the range of n_estimators between 10 and 1000 and changed the number of cross validation folds from 3 to 5.

I then saved the result in a variable:

best_params = rs.best_params_
best_params =
{'n_estimators': 434,
'min_samples_split': 10,
'max_leaf_nodes': 19,
'max_features': 0.6,
'max_depth': 5,
'bootstrap': False}

Then I refitted the Random Forest model with the new parameters:

Voila!

For our home team goal prediction; the mean absolute error from 1.24 goals stayed the same, but the accuracy improved from 21.28% to 24.82%.

Redoing the entire process from above for the away team goal prediction, I managed to reduce the mean absolute error from .98to .89 goals and increase the accuracy from 31.21% to 35.46%.

At the end of the ‘home_team_prediction’ and ‘away_team_prediction’ notebook I save the results locally in excel files, which will be loaded in the fifth notebook.

Merging Predictions

In the result.ipynb I loaded the excel files / results from the ‘home_team_prediction’ and ‘away_team_prediction’ notebooks and merged them into one dataframe. The columns ‘HTGDIFF’ and ‘ATGDIFF’ are the actual/test goal differences from matches in the past. For clarification I renamed them to ‘test_HTGDIFF’ and ‘test_ATGDIFF’. In order to compare them with my predictions I created two new columns with goal differences for home and away teams, naming them ‘pred_HTGDIFF’ and ‘pred_ATGDIFF’. Our dataframe then looks like this:

While using ‘pred_HTGDIFF’ or ‘pred_ATGDIFF’ (does not mater since we will only use absolute numbers) to calculate the mean absolute error and accuracy we get the following results:

Additionally to measuring the accuracy on just predicting goals, I calculated the overall accuracy between my model an the test data to see how successful I could predict wins, draws and losses:

Regarding all home team games with a winner I predicted correctly 51%, for draws 29% and for losses 63%. For the predictions for the away teams games, the draws stay the same at 29% but the percentage for the wins and losses interchange.

Results

Regarding the predictions per goal I have an accuracy of 23%.
The obtained accuracy of a game’s outcome (win, draw or loose) is of 51%.

This means we (barley) made it over the 50% mark!

If we would see as the goal to predict successfully and with high accuracy the exact number of goals that are going to be scored by home and away teams, we would still have a long way to go.

But if we consider predicting the outcome of an individual football match as our aim we have matched and exceeded the method of just randomly guessing who will win, draw or loose. Therefore in a future blog post I plan to construct an alpha that takes the strength and share of wins/draws/losses of the opposing team in consideration to significally improve this value. Instead of just predicting numbers of goals and calculating the difference, I would weight teams like Bayern Munich heavily and multiply the predicted goals with an alpha that enlarges the goal prediction. The same goes for teams that performed very poorly in past games and therefore decreases the predicted number of goals shot by it.

Summing up, just by predicting home/away goals with a random forest regressor or the features I choose is just enough to have a positive outcome. Maybe, as written before, weighting teams with an alpha considering historical data or choosing different features might improve the predictions.

Another interesting possibility I see is to work with the historic odds of betting companies, as described in this paper or take a closer look at the Poisson distribution regression model.

In another blogpost I managed to obtain an accuracy of ~64% clustering the teams with an archetype analysis and then using XGBoost to predict results. You can find it here.

References

[1] https://www.football-data.co.uk/germanym.php

[2] https://dashee87.github.io/football/python/predicting-football-results-with-statistical-modelling/

[3]https://www.sciencedirect.com/science/article/pii/S2210832717301485?via%3Dihub

[4] https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

[5] https://arxiv.org/pdf/1710.02824.pdf

[6] https://towardsdatascience.com/random-forest-in-python-24d0893d51c0

[7] https://towardsdatascience.com/improving-random-forest-in-python-part-1-893916666cd

--

--