Predicting Football Results using Archetype Analysis and XGBoost

18 min readJun 29, 2020

Introduction

People are betting on sports since mankind exists. Probably the first-time sports betting was recorded, was during the first gladiator games in Rome 264 BC [1]. They always decided which fighter to choose based on their gut feelings or they just counted how often he won (if he would not have won, he would not be fighting in this moment). Also, appearance was very important like equipment, how muscular or tall he was etc.

It is not different now days. People tend to play a save game and want to bet on an overall successful team to minimalize the risk. Others bet on their favorite team, even though they are playing a miserable season. And others try to observe how a team is performing during the last games and estimate with their guts how they will perform in the next game.

I used to be one of these guys (just with betting fake money in some online portals). But without having a real system I understood that you will most likely end up loosing.

Therefore, I am spending some of my free time using Data Science to find patterns, cluster teams and to use multiple models to get a high as possible accuracy, by predicting outcomes of football matches.

In order to shorten the post, I am not including all used code. The complete code used in this project can be found here.

Project Overview

This project has three main parts:

Archetype Analysis (AA)
XGBoost Data Preprocessing
XGBoost Analysis

First, I will use the AA as a way of dimensionality reduction to point out in a dataset, a certain number of archetypes. Classical archetypes to be expected are high performing clubs like Real Madrid or Bayern Munich. But also, medium and low preforming clubs can create archetypes in certain combinations of data, which can help to describe similarly performing teams.

The output of the AA is going to be n-columns, with a percentage describing the affiliation of a team to a certain performance group. These columns will be used as features in the upcoming XGBoost Analysis.

What is XGBoost? “XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework.”[2]. I will be using XGBoost, because of its fast execution speed and high modeling performance.

The XGBoost model should predict, if in an upcoming match the home or away team will win, or if they play a draw.

Problem Statement

In my last analysis, I tried to predict football results by predicting the home- and away team goal difference [3]. With a Random Forrest Regressor and Hyperparameter tuning, I obtained an accuracy of 51%, which is slightly better than guessing.

I would still not feel comfortable betting real money on football matches.

According to the article ‘A machine learning framework for sport result prediction’[4], machine learning experts achieve an accuracy between 60–65% predicting sports results using an Artificial Neural Network (ANN) .

My goal of this project is to get “close” to the 60% accuracy mark, but at minimum 55 %.

Metrics

As basic evaluation metric for this project, I will use the classification accuracy. The accuracy has the following definition:

In this case I will evaluate how many matches I predicted correctly, checking the accuracy with historical data.

Analysis

Data Exploration

The data used for this project can be found at the website footystats.org [6]. It can be divided in two parts:

Match related data
Team related data

FootyStats Team related Datasets
The team related dataset is going to be used for the archetype analysis.
The set consists of aggregated data for each team in the league for a single season. Depending on the country the dataset can have between 18 and 24 rows (number of teams in the first division) and 276 columns. Most of the values are divided into total, home and away columns.

The set contains data that describes a team’s basic season figures, such as number of matches_played, wins, draws, losses, but also goal related information like total_goal_count, goals_conceded_away or minutes_per_goal_scored_home etc.

Since the datasource has already clean and complete data, there is not much basic data wrangling and anomalies to expect.

In order to have enough data points to create meaningful archetypes, team-results from five European top leagues were used: Germany, France, Spain, Italy and England.

FootyStats Matches Datasets

As the teams related datasets, the match related data sets can be found at footystats.org [6].

This data will be used to train the XGBoost model and to predict results for upcoming matches. It contains statistics and results for all future and historical matches in the league with pre-match form. Especially the latter is significant for the prediction of future results.

In the feature importance analysis (‘XGBoost-Analysis.ipynb’), the following pre-developed features are going to be of high relevance [6]:
away_pp — Points Per Game for Home Team — Current
home_ppg — Points Per Game for Away Team — Current odds_ft_away_team_win — Odds — Away Team Win at Full Time odds_ft_home_team_win — Odds — Home Team Win at Full Time

In the following screenshot you can find an overview of team-related dataset:

Since the datasource has already clean and complete data, there is not much basic data wrangling and anomalies to expect.

Algorithms and Techniques

archetype_analysis.ipynb
The proposed solution to predict football results as accurate as possible, is to first cluster football teams from five European top leagues into n archetypes, and then append those new features to the match related dataset, which will then be used to train an XGBoost model.

Friedrich Leisch and Manuel J. Eugster described in their paper: ‘The aim of archetypal analysis is to find “pure types”, the archetypes, within a set defined in a specific context’ [7]. The first time AA came up in the statistical context, was 1994, when the concept was introduced by Cutler and Breiman. They defined archetypes as the following: ‘Archetypes are selected by minimising the squared error in representing each individual as a mixture of archetypes.’ [8].

The AA tries to approximate a convex hull from a set of data. As seen on Figure 3 above [9], through multiple iterations calculating the RSS (residual sum of squares) the approximation can be increased and the points
outside of the convex hull can be minimised. The main benefit of using AA, is that the archetypes themselves are restricted to being mixtures of individual data points, which then can be easily interpretable by human experts [9].

The AA is very suitable classifying football teams, since high, medium and low performing teams have clearly identifiable patterns. On a first view, goals scored or conceded by a team, can give a first indication of the strength of a team. But by creating more than three groups (high, medium, low), you can obtain a more detailed view of each category. Even between the high preforming teams, you differentiate and create multiple sub-groups. This is especially interesting, when two teams play against each other and they are relatively equal strong, but then it might be helpful to see, how much percent of a lower or higher group, they ‘have in them’.

Let’s say Bayern Munich plays against Borussia Dortmund. On first sight, you might say, it ‘will be a tight and interesting game’, since both teams have been the best German teams during the last years. But taking a closer look at their archetype affiliation can help to give a better prediction. If one of the teams have a low, but present affiliation in a group which represents usually low or medium performing clubs, this can be an indicator, that they share some common patterns with teams from these groups.

XGBoost_preprocess.ipynb

After having created archetype groups, we will use the percentual affiliation of the teams belonging to these groups, as features to train the XGBoost model.

Besides the AA, we will see more feature engineering. We will calculate the average goal difference for home and away teams for the last 4 games (home or away).

Then we are going to append the results from the last three matches per team (win, draw or loss) as columns/features to the data frame.

We will also append the absolute home or away team goal difference and the average home or away team goal difference from the last two games to the data frame.

XGBoost-Analysis.ipynb

In the last notebook we will then fit and train a XGBoost model with the pre-developed and self-engineered features from the previous notebook.

‘XGBoost is an open source library providing a high-performance implementation of gradient boosted decision trees.’ [10]. Instead of just building multiple trees at parallel, it builds them sequentially in order to reduce the errors from the previous tree.

Since we ‘only’ have around 1400 rows, it’s important to make sure the model does not learn all possible ways to predict the correct answer, which leads to overfitting. For this purpose, we will be using k-fold cross validation which divides the data in n equally large data sets, with a parameter to ensure the data is shuffled.

While creating the XGBoost classifier, as parameter: ‘objective’, we will use ‘multi:softprob’, where the result will contain predicted probability of each data point belonging to each class. This is helpful since we do not have a binary, but a multi classification problem.

At the end we will have a look at the importance of each feature and which ones could be left out.

Benchmark Model

As a benchmark model I will use the random forest classifier model I build in my former attempt to predict football results, where I obtained an accuracy of 51% [3].

Besides that, I will train a logistic regression model with the same data I prepared for the XGBoost model.

I will also use the statement from the paper ‘A machine learning framework for sport result prediction’[4] as a benchmark, where they stated that experts achieved an accuracy between 60–65% using an Artificial Neural Network.

Comparing the results from the XGBoost model I will train, with the mentioned sources and models, can give us a first impression of the accuracy of my model’s performance.

Methodology

Archetype Analysis

Since the AA is an essential part of this project, I started the analysis independently of the rest of the project. The features I create in this part can be used for most other machine learning models.

The first thing I do is to import all necessary packages needed for the AA:

Most of them are well known, except ‘clustering’. This package contains the actual logic behind the AA and also includes a newly, from my colleague Dr. Luke Bovard created visualisation method which will display the results in one graph. Later in this subchapter, we will go into more details. The code can be found here.

Then I load the team related datasets into the notebook and save them as dataframes:

These csv files consist of aggregated data from the German, French, Spanish, English and Italian football league, for the ongoing season 2019–2020. Since for the predictions I will use match related data from the last 5 years, it’s important to understand, that football teams which used to play in the first division, are not included in the teams-dataset from the season 2019–2020. This will lead to errors, since the XGBoost model does not accept NaN values.

For the XGBoost model I will work only with the German Bundesliga, since I did not had time to automate the above described issue. I looked up the teams that used to play in the German Bundesliga and in which year the ascended to the second Bundesliga.

Then I selected the corresponding row and saved it in a dataframe. In total there were six teams. One of them was ‘Hamburger SV’ or short ‘HSV’:

Another issue was, that due to the corona Pandemic, all mayor European leagues are paused, and they just completed 25 games (06.03.20–08.03.20). But the teams that ascended in the past years played a complete season with 34 games (German Bundesliga). Therefore, I multiplied important columns which contain information, that could influence the AA result with 0.75 (25 of 34 games).

After merging the European leagues with the df ,that contains the German ‘climbers’ we have a dataframe (df_all) that looks like the following:

It has 104 rows (teams) and 280 columns (features).

During the AA, the model is first going to calculate the archetypes for each feature. Each feature includes a combination of values, which includes outliers that can indicate behavior which might be relevant for the further analysis. For example, a high number in the ‘wins’ column can indicate a high performing team.

In order to find the archetypes within the features first, we have to transpose the ‘df_all’ dataframe, select only categorical features and then normalize the data.

At the end we save the data frame as a matrix (which is required by the AA algorithm).

AA Application

It is not the scope of this project to explain the math and technical logic behind the AA, but I want to give you a brief overview, how to create features for commonly used machine learning models.

Most of the python code used for the AA was written by Artur Miller [11] and my work colleague Dr. Luke Bovard (‘def archetypal_plot’, line 149, clustering.py).

The code can be found in here.

There is no concrete rule to choose the right number of archetypes (k). The best way is to play around with the number of iterations (i) and observe when the RSS curve flattens. This could be an indicator to have selected the right k [7].

After experimenting with multiple k and i, I got the best result with k=5 and i=50. At the beginning of the iterations, bigger RSS minimisations are visible, whereas at the end a flattening of the curve is observable.

In Figure 6, we can see the AA for the first feature in our matrix X, ‘wins’. Each blue dot represents one of 104 teams, and the orange dots represent one of the five archetypes which were calculated (Z). In the upper right of the graph is a group of teams, that differentiate strongly from the rest of teams. In the middle we can see a group of teams, which can be identified as teams which win on a regular basis, and at the lower left, we can see the biggest concertation of clubs, which indicates teams, that win not so frequently.

In order to achieve the archetypes on a team level, we first have to transform the results by calling the transform function and passing in the initial input dataset X. This returns us the 5 archetypes for the 104 football teams (Germany, France, Spain, Italy and England).

Then we have to convert the results (archetypal) into a two-dimensional array, by calling the function map2D a.k.a. np.vstack((x,y)). This returns us the coordinates in the graph for the 5 archetypes.

The function ‘transform(self, X)’ calls the function ‘_computeA(…)’:

After taking the dot-product and calling the function archetypal_plot(ax,dat,dp,epsilon=.1), we obtain an overview of the distribution of the football clubs and where clusters were identified:

Although figure 7 gives us a first overview of the distance between each team and the corresponding archetypes, we still need the percentual share per team of affiliation to each of the five archetype groups:

By taking a first look at a snippet of the results and with a basic understanding of the European football world, its viewable, that group 4 and 5 tend to represent high performing clubs (‘BVB 09 Borussia Dortmund’, ‘FC Bayern München’, ‘FC Barcelona’), while group 1 and 2 tend to represent low or medium performing clubs (‘1. FC Köln’, ‘FC Augsburg’, ‘Real Club Celta de Vigo’).

Nevertheless, teams like Barcelona or Real Madrid also have certain properties, that makes them be part of group 1 and 2. This might be interesting, especially when those equally strong teams play against each other. The performance during the last games influences the percentage of a team in each group.

At the end I save the result together with the labels (team name) in a dataframe and store it as a pickle file.

Additional Feature Engineering

In this subchapter, I will create additional features, which will include information from past matches. I will only focus on the German Bundesliga and include match related data from the season 2015–2016 until 2019–2020.

First, I upload the pickle file with the results from the AA and the csv file with the match related data.

After creating an empty df(df_empty_columns) with the column names of the df_teams_aa, I append “ht_” (home team) or “awt_” (away team), and then concatenate it with the match related dataframe.

Then by iterating through each row, I place the results from the AA to each (home / away) team (per row).

The result is a dataframe with match related data and the results from the AA for the home and away team per row.

For most of the upcoming features, I will need the home and away team goal difference (HTGDIFF, ATGDIFF):

In my last project [3], I had great results using the average HTGDIFF and average ATGDIFF. Therefore, I will reuse some of the code. The outcome will be the average goal difference per match per team from the last 4 games until the corresponding match day.

After transforming the dictionary (outcome from avg_goal_diff) into a dataframe, I also append it to the match related data set.

This give us the ‘result’ column.

In order to get enough data for the model, I am also including:

· ‘result = last three results for home and away team
· ‘HTGDIFF / ATGDIFF = last three home and away team goal differences per game
· ‘AVGFTHG / AVGFTAG’ = average goals scored by home and away team from the last

Through the python method .assign(), I managed to break the arrays with the past values for ‘result, ‘HTGDIFF / ATGDIFF and ‘AVGFTHG / AVGFTAG into each row:

Which then looks something like this (partly):

The shape of our data is now 1437 x 80.

Then I select all numerical values, drop all unnecessary features and normalise the values which would create noise and alter our accuracy.

Since we just want to predict future matches, we can select only columns that contain historical averages (as we created) or pre-developed features from footystats.org. and drop all the other rest.

Until the index number ‘int_for_test’ from df_x(df_norm); all those rows with historic matches will be used to train the models. int_for_prediction = int_for_test -20 are the matches we want to predict with the models. In a future exercise I will automise this process.

In this case (until the next match day) the values are:

X,Y and Z then are save as pickle files, to be loaded in the notebook ‘XGBoost-Analysis.ipynb’.

XGBoost-Analysis.ipynb

In the last two subchapters, we created or transformed features, in order to have enough data to predict matches in the future.

In ‘XGBoost-Analysis.ipynb’ I first load in the result X, Y and Z and save them into dataframes.

First, I used ‘train_test_split’ from sklearn to create my testing data, with test_size of 0.3.

Using this test/train split in XGBoost model, gave me as a result a very high percentage of accuracy (about 78%). But after discussing it with my work colleagues, we then also printed the accuracy for the training set, and it gave us an accuracy of 100%.

My model was overfitted. Since we ‘only’ have 1400 rows with data/matches, the model could memorize ‘the noise instead of finding the signal’ [12]. The noise is irrelevant data in your dataset and the signal is the true underlying pattern you want your model to learn.

One option would be to do feature reduction or to improve the data split by using cross validation.

I created 4 folds and enabled the ‘shuffle’ param using the KFold method. This gives me 4 equally sized folds (Train & Test), which shuffles our data efficiently and helps us to avoid overfitting.

Another way to prevent overfitting is using the L1 regularisation term on weight. After playing around I figured out, that reg_alpha = 0.2 is a good fit and gives me (in combination with other params) a good result.

To get results with a multi classification problem, I found in the XGB documentation, that as objective=’multi:softprob’ is a good fit.

In order to avoid some columns to weight too much in the prediction, you can play around with the colsample_bytree param. I choose .5 since its recommended for sets with many columns to choose a value between .3 and .8 [13].

I chose the other parameters, through fitting the model multiple times (after restarting the Kernel from jupyter-notebook every time) and observing which combination gives me the best result. In a further analysis I will include more advance techniques like Grid Search or Hyperparameter tuning.

Long Story shot: I tried to make it as hard as possible for my classifier to memorize paths and to avoid overfitting, especially since my data set is not that large.

Then I fit the model with X_train and y_train and predict it then with X_train:

Results

As a result, for the XGBoost model I get:

XGB train Accuracy: 70.57%
XGB Accuracy: 63.14%

The percentual difference between the train and test accuracy is relatively small, which also indicates, that we might not have overfitting problems.

Surprisingly the result from the simple linear regression model was slightly better:

LR train Accuracy: 69.71%
LR Accuracy: 65.14%

So, awesome we did it and better than expected (Benchmark)!?

Yes and No.

These are the value_counts from our labels train & test:

(Remember: 3 = Home win, 2 = draw, 1 = Away win)

The problem I see, is that even with reading a lot and implementing objective=’multi:softprob’, we are having a really hard time predicting draws.

y_train has a draw-share of 23%. The draw-share from xbg- and lr y_train is between 2–3 %.

This is a known problem in the football prediction world. Even professional bookmakers do not bet on draw [14].

Nevertheless, the accuracy result for both models is a huge improvement to my previous attempt using random forest trees, where I had an accuracy of 51%.

Taking a look at our feature importance analysis for the XGBoost model, we can see that mostly self-engineered features are amongst the top 15 features.

Another insight is, that there are not dominant features with a high percentage. They are all between 0 and 5%.

I am also very satisfied with the results from the AA. Observing the feature importance listing above, I can see that the AA results play a relevant role in creating successful predictions.

In order to actually use the XGBoost or LR model, you first have to adjust Z in ‘XGBoos-preprocess.ipynb’.

With the variable int_for_prediction, you can adjust the integer which is rested from int_for_test.

It is satisfying to write, that my model actually got more than close to the 60% accuracy mark, which was described in the paper ‘A machine learning framework for sport result prediction’ [4]. Even though they used an Artificial Neural Network, I could obtain a similar result using and XGBoost model and Linear Regression Classifier.

Conclusion

At a first glance, the result is satisfying. Checking it multiple times lets me say:

Yes, I might start soon betting real money on football games. But there are still a lot of thing to do left.

First, I have to improve to improve the accuracy regarding the draws. Draws are basically not existent in my predictions, which could be a problem in the long run.

I will continue experimenting with the parameters and start with hyperparameter tuning using GridSearch.

Then I will have to automize a lot of code and create a working data pipeline from downloading/updating the files automatically, saving them in a database, selecting the matches that have to be predicted and compare the predictions with real world results.

But the most promising aspect of this analysis is still the archetype analysis. I am sure I can get much more benefits out of the results, then just using them as descriptive features for the XGBoost model. In order to not lose money, I have to create a betting strategy. I could analyse the performance of the archetype groups and their teams, for example how does the accuracy change if I train the model without matches that include teams from the top performing groups? How high is my accuracy for matches, that include teams from the same archetype group? How many ‘surprises’ could I predict?

Thanks for reading.