Clustering Football Teams using Archetype Analysis

7 min readJul 14, 2020

For a while, I am trying to predict football results using multiple techniques like Random Forrest Regressor, Linear Regression or XGBoost. So far I got the best results using XGBoost. But besides using a more potent way of decision trees, I also started to apply the Archetype Analysis(AA) as part of the feature engineering.

In this article I just want to focus on the AA. I will include most of the Python code here. I am using Python 3.7 and most of the logic is stored in Jupyter-Notebooks. The mathematical calculation is imported from a created packed in order to save space in the notebooks.

The entire work regarding AA and XGBoost predicting football results (multi-class) can be found here.

I will use the AA as a way of dimensionality reduction to point out in a dataset, a certain number of archetypes. Classical archetypes to be expected are high performing clubs like Real Madrid or Bayern Munich. But also, medium and low preforming clubs can create archetypes in certain combinations of data, which can help to describe similarly performing teams.

The output of the AA is going to be n-columns, with a percentage describing the affiliation of a team to a certain performance group. These columns can be used as features in most machine learning models.

Data Set

We are going to use, in this example, a data set that consists of aggregated data for each team in the league for a single season. Most of the values are divided into total, home and away columns. The data set comes from footystats.org. You have to pay for most of the sets, but they give you the premier-league team-data-set for free, which can be downloaded here.

You can also find team-data sets from Italy, Germany, Spain, France and England (13.07.2020) in my GitHub folder.

Since the datasource has already clean and complete data, there is not much basic data wrangling and anomalies to expect.

Algorithms and Techniques

But before we start coding blindly, lets try to understand what an Archetype Analysis is and what its good for.

The proposed solution to predict football results as accurate as possible, is to first cluster football teams from five European top leagues into n archetypes, and then append those new features to any other football related dataset.

The AA is an unsupervised learning and can be seen as a cluster analysis.

Friedrich Leisch and Manuel J. Eugster described in their paper: ‘The aim of archetypal analysis is to find “pure types”, the archetypes, within a set defined in a specific context’ [1]. The first time AA came up in the statistical context, was 1994, when the concept was introduced by Cutler and Breiman. They defined archetypes as the following: ‘Archetypes are selected by minimising the squared error in representing each individual as a mixture of archetypes.’ [2].

The AA tries to approximate a convex hull from a set of data. As seen on Figure 1 above [3], through multiple iterations calculating the RSS (residual sum of squares) the approximation can be increased and the points
outside of the convex hull can be minimised.

The main benefit of using AA, is that the archetypes themselves are restricted to being mixtures of individual data points, which then can be easily interpretable by human experts [3].

The AA is very suitable classifying football teams, since high, medium and low performing teams have clearly identifiable patterns. On a first view, goals scored or conceded by a team, can give a first indication of the strength of a team. But by creating more than three groups (high, medium, low), you can obtain a more detailed view of each category. Even between the high preforming teams, you differentiate and create multiple sub-groups. This is especially interesting, when two teams play against each other and they are relatively equal strong, but then it might be helpful to see, how much percent of a lower or higher group, they ‘have in them’.

Let’s say Bayern Munich plays against Borussia Dortmund. On first sight, you might say, it ‘will be a tight and interesting game’, since both teams have been the best German teams during the last years. But taking a closer look at their archetype affiliation can help to give a better prediction. If one of the teams have a low, but present affiliation in a group which represents usually low or medium performing clubs, this can be an indicator, that they share some common patterns with teams from these groups.

Code Implementation

You can either create a new Jupyter-Notebook (.ipynb ending) or an empty python file in your IDE.

The first thing I do is to import all necessary packages needed for the AA:

Most of them are well known, except ‘clustering’. This package contains the actual logic behind the AA and also includes a newly, from my colleague Dr. Luke Bovard created visualisation method which will display the results in one graph. Later in this subchapter, we will go into more details. The code can be found here.

Then I load the team related datasets into the notebook and save them as dataframes:

During the AA, the model is first going to calculate the archetypes for each feature. Each feature includes a combination of values, which includes outliers that can indicate behavior which might be relevant for the further analysis. For example, a high number in the ‘wins’ column can indicate a high performing team.

In order to find the archetypes within the features first, we have to transpose the ‘df_all’ dataframe, select only categorical features and then normalize the data.

At the end we save the data frame as a matrix (which is required by the AA algorithm).

It is not the scope of this blogpost to explain the math and technical logic behind the AA, but I want to give you a brief overview, how to create features for commonly used machine learning models.

Most of the python code used for the AA was written by Artur Miller [4] and my work colleague Dr. Luke Bovard (‘def archetypal_plot’, line 149, clustering.py).

The best way to access the code and call the functions needed to calculate the AA is to create a folder named ‘clustering’ and to paste this files. Then you can import and access the functions by importing the package at the top of your notebook/python file:

import clustering as cl

There is no concrete rule to choose the right number of archetypes (k). The best way is to play around with the number of iterations (i) and observe when the RSS curve flattens. This could be an indicator to have selected the right k [1].

After experimenting with multiple k and i, I got the best result with k=5 and i=50. At the beginning of the iterations, bigger RSS minimisations are visible, whereas at the end a flattening of the curve is observable.

In Figure 4, we can see the AA for the first feature in our matrix X, ‘wins’. Each blue dot represents one of 98 teams, and the orange dots represent one of the five archetypes which were calculated (Z). In the upper right of the graph is a group of teams, that differentiate strongly from the rest of teams. In the middle we can see a group of teams, which can be identified as teams which win on a regular basis, and at the lower left, we can see the biggest concertation of clubs, which indicates teams, that win not so frequently.

In order to achieve the archetypes on a team level, we first have to transform the results by calling the transform function and passing in the initial input dataset X. This returns us the 5 archetypes for the 98 football teams (Germany, France, Spain, Italy and England).

In order to plot the results, we have to call multiple functions on the results from the AA: archetypal. This shows us the distribution of the football clubs and where clusters were identified:

Although figure 5 gives us a first overview of the distance between each team and the corresponding archetypes, we still need the percentual share per team of affiliation to each of the five archetype groups:

By taking a first look at a snippet of the results and with a basic understanding of the European football world, its viewable, that group 4 and 5 tend to represent high performing clubs (‘BVB 09 Borussia Dortmund’, ‘FC Bayern München’, ‘FC Barcelona’), while group 1 and 2 tend to represent low or medium performing clubs (‘1. FC Köln’, ‘FC Augsburg’, ‘Real Club Celta de Vigo’).

Nevertheless, teams like Barcelona or Real Madrid also have certain properties, that makes them be part of group 1 and 2. This might be interesting, especially when those equally strong teams play against each other. The performance during the last games influences the percentage of a team in each group.

Summary

I hope this easy and simple introduction to Archetype Analysis helps you to cluster your data set and identify your extreme values. The AA can be especially useful, if you are observing a market or trend and want to point out the differences over time. Basically you can you can use the AA as part of your feature engineering and use the newly gained features in any other machine learning model. If you want to take a deeper look behind the curtains of the Archetype Analysis, I recommend you this or this article.

Thanks a lot for reading.

References:

[1] Friedrich Leisch, Manuel J. Eugster: https://www.researchgate.net/publication/46515738_From_Spider-Man_to_Hero_-_Archetypal_Analysis_in_R

[2] Adele Cutler, Leo Breiman: https://digitalassets.lib.berkeley.edu/sdtr/ucb/text/379.pdf

[3] Christian Bauckhage, Dr. Christian Thurau: https://link.springer.com/chapter/10.1007/978-3-642-03798-6_28

[4] Artur Miller: https://miller-blog.com/archetypal-analysis/