Clustering communicated via a sieve metaphor

World Cup 2018 Clustering Analysis

In search of clusters and patterns within team data from the world's greatest tournament using unsupervised machine learning

Published in

3rd corner

17 min readDec 28, 2021

As a design researcher with experience applying qualitative and quantitative approaches, I like blending the worlds of design and data science where I can. Over the last six months I decided to return to studying Python. I am not a refined nor expert programmer by any means, but I enjoy learning Python to better collaborate with data scientists and to perform my own personal and professional analyses. In this article I share the thinking and coding process I used to run an unsupervised clustering algorithm on teams from the 2018 World Cup.

Football (soccer, in the USA where I grew up) has always been a part of my life — I played organized soccer from kindergarten until graduating from college. Every four years the World Cup produces my favorite 30 days of sport full of interesting data. So when I found a dataset for the 2018 World Cup, I was eager to dive in and test my evolution as a Python beginner. At a high level, I wanted to perform a clustering analysis using average team stats as variables. What would I find? Perhaps teams would cluster by “style of play”, by region, or something else. I used Google Colab, a Google Research product that is a hosted Jupyter notebook service allowing anybody to write and execute arbitrary Python code through a browser. It’s dope.

The dataset I used was from a cool person called DH. I downloaded the csv and then uploaded it into Colab to get started. Improvement Note: arguably a more efficient way to do this would be to insert the url of the file directly into my Python script with a df =pd.read_csv(url).

Outline of Objective

explore data; clean and preprocess features needed for clustering
run k-means clustering algorithm to see what clusters emerge (note to reader: jump here if you just want to see clustering output + analysis)
create 2x2 visualizations to see how different teams/clusters compare

(1) Exploring, Cleaning, Preprocessing

🗃 (1A) Importing Data, Initial Looks and Adjustments

First, I import the data.

#import the necessary libraries and the data
import numpy as np
import seaborn as sns
from google.colab import files
import io
import pandas as pd
data = files.upload()
import warnings
warnings.filterwarnings(‘ignore’)

The above prompts me to upload the csv file and I do so with the line of code displayed below. This pulls the data in as a pandas dataframe (think of a dataframe as an excel table).

df=pd.read_csv(“world_cup_2018_stats.csv”)

To look at the first and last rows of the dataframe(df) I use the .head() and .tail() functions.

What I see:

each game is represented by 2 rows of stats — one row for each team that played (thus, there are 128 rows for the 64 World Cup games)
“Group” column shows at what tournament stage a given game occurred

What I do:

I like having the option to investigate aggregate differences between tournament stages, so I create a simple dataframe which shows each team and the most advanced tournament stage that team reached
I filter the original dataframe based on stage (“Group” column) to see which teams got to which stages (Group, Round of 16, Quarterfinals, Semifinals/Final).
I create a new dataframe matching each team to its latest stage

Here is the code I used for filtering:

dfrsixteen = df.loc[df[‘Group’] == “Round of 16”,:]
dfquarter = df.loc[df[‘Group’] == “Quarter-finals”,:]
dfsemi = df.loc[df[‘Group’] == “Semi-finals”,:]

I then use what this filtering code produced to create a new dataframe, dfround, for potential use later on.

Simple output of the new dataframe, dfround

👾 (1B) Exploring the Main Dataset

I use df.info() to see what data types I have. I do this because k-means clustering only works with numeric data and I want to see if I need to convert any data types.

The variables I will want to use are integers. "Pen Shootouts" are floats and "Group", "Team", "WDL" are objects. I will remove these for clustering.

Then I use df.describe().transpose() to see a summary of the distribution of all numeric data.

In general, I see expressive standard deviations, given each variable's max and min, as well as relevant quartile values. For many variables there is a nice difference between min and max values — this is good. What would not be good is if there was not much distance between max and min values (as would be the case if a variable only had values like 1,2,3). Because k-means is a distance-based algorithm, it is preferable to have variables which have expressive differences (not 1,2,3) and are continuous.

To decide which variables I'll want to isolate for clustering later on I ask three questions:

which variables will best differentiate teams?
which variables are more descriptive of how a team plays?
which variables have an expressive range of values?

I settle on the following list of variables:
‘Goals For’, ‘Goals Against’, ‘Attempts’, ‘On-Target’,’Blocked’, ‘Woodwork’, ‘Corners’, ‘Offsides’, ‘Ball possession %’, ‘Pass Accuracy %’, ‘Passes’,
‘Distance Covered km’, ‘Balls recovered’, ‘Tackles’, ‘Blocks’, ‘Clearances’, ‘Yellow cards’, ‘Red Cards’, ‘Fouls Committed’

🔩 (1C) Dataframe Manipulation

I choose to create a team-based view of the data in order to more clearly compare teams. Essentially, I want to calculate a given team’s average game stats based on all the games that said team played. The downside to this: since not all teams played the same amount of games, the average stats of teams that went further in the Cup will consider more data. But since I already don’t have that much data to work with and since I don't want to overcomplicate things, I decide to accept that inefficiency.

I create a new dataframe with a team-based view using the following code.

df_teamsort1 = df.sort_values([“Team”], ascending=True)

Improvement Note: with variable names it is usually preferred to not use numbers, so "df_team_sorted" might have been better than what I chose above.

Note: I only show part of the output here as there are still a lot variables that are off-screen to the right

This team-based view is interesting not only because it moves me closer to comparing and clustering teams, but also because, if I want, I can see what each team does over the course of the Cup.

Next, I officially select the variables that I want to use for clustering.

#selecting variable for clustering (minus Group, Team, WDL)df_teamsort1_redux = df_teamsort1[[‘Group’,’Team’, ‘WDL’,’Goals For’, ‘Goals Against’, ‘Attempts’, ‘On-Target’,’Blocked’, ‘Woodwork’, ‘Corners’, ‘Offsides’, ‘Ball possession %’,’Pass Accuracy %’, ‘Passes’, ‘Distance Covered km’,’Balls recovered’, ‘Tackles’,’Blocks’, ‘Clearances’, ‘Yellow cards’,’Red Cards’, ‘Fouls Committed’,]]

I need to rename certain columns so they are amenable to potential manipulation later on. For example, if I want to filter or execute operations on a column like “Pass Accuracy %” Python may very well give me syntax errors because of the spaces and symbols. So, I change “Pass Accuracy %” to “Pass_Accuracy_pct”.

#renaming columnsdf_teamsort1_redux.columns = [‘Group’, ‘Team’, ‘WDL’, ‘Goals_For’, ‘Goals_Against’, ‘Attempts’, ‘On_Target’,’Blocked’, ‘Woodwork’, ‘Corners’, ‘Offsides’, ‘Possession_pct’,’Pass_Accuracy_pct’, ‘Passes’, ‘Distance_km’, ‘Balls_recovered’,’Tackles’, ‘Blocks’, ‘Clearances’, ‘Yellows’, ‘Reds’, ‘Fouls’]

Improvement Note: this line of code is long and hardcoded parameters can block future code flexibility and expansion. I could have used a list to solve this with something like the following

variables_of_interest = ['Group', ….]
df_teamsort1_redux = [variables_of_interest]

I also perform some operations / add new columns that will be useful for clustering. First, I create a “Goals_delta” variable which gives me, per game, goals a team scored minus goals a team allowed. This tells more of the story per game (and per team) than simply how many goals a team scored. Second, I create a “On_target_pct” variable, which is nice for k-means because percentages are continuous variables. I set this variable's type to integer (intuitively this should be a float, but for some odd reason the first time I executed the code I got a type error, one that went away after converting its type to integer. Anyway, onward).

df_teamsort1_redux[‘Goals_delta’] = 
(df_teamsort1_redux[‘Goals_For’] - df_teamsort1_redux[‘Goals_Against’])
 
df_teamsort1_redux[‘On_target_pct’] = 
(df_teamsort1_redux[‘On_Target’] / df_teamsort1_redux[‘Attempts’])*100
 
df_teamsort1_redux[‘On_target_pct’] = df_teamsort1_redux[‘On_target_pct’].astype(int)

🍩😐 (1D) Normalizing Data

The next steps are noteworthy because I aggregate and normalize the data, making operations less painful going forward.

First, I take the variables (columns) that I want to use for clustering and turn them into one variable, columnslist_normalized. This is efficient because I can drop this one variable into future functions as well as use it to alter all the cluster-relevant columns in just one place.

columnslist_normalized = [‘Goals_For’, ‘Goals_Against’, ‘Attempts’,
‘On_Target’, ‘Blocked’, ‘Woodwork’, ‘Corners’, ‘Offsides’,
‘Possession_pct’, ‘Pass_Accuracy_pct’, ‘Passes’, ‘Distance_km’,
‘Balls_recovered’, ‘Tackles’, ‘Blocks’, ‘Clearances’, ‘Yellows’, ‘Reds’,‘Fouls’, ‘Goals_delta’, ‘On_target_pct’]

I put this new variable, columnslist_normalized, to use when I create df_teamavg, a new dataframe that will house average (mean) stats for each team. Here, I use the groupby function in order to pull all stats for a given team into one row.

df_teamavg = df_teamsort1_redux.groupby(‘Team’, as_index=False)[columnslist_normalized].mean()

Process note: I create new dataframes (or copies) throughout my work as it helps me keep track of the progression of my data transformations. For me, it makes step-by-step issue diagnosis easier.

Note: the above copy/paste job from Colab shows all mean variables for only 9 teams just to give a general sense of the transformed data

What I see:

the variables have different ranges and scales

What I want to do:

remove this variation and make the variables more comparable. This calls for normalization, the act of rescaling data so that all values are within a new, comparable range of 0–1.

This post helps to explain why normalization is relevant for k-means - it helps avoid implicit, unwanted variable weighting which could impact clustering.

To execute normalization I use the scikit-learn class MinMaxScaler. When one instantiates a class like MinMaxScaler, said class transforms into an object.

SAY WHAT?

Think of scikit-learn as a “library” with a bunch of “books” (classes). When these "books" are “opened” (instantiated into objects) one can put "ideas and info" (data) into them. Then one can do cool things with said data. So, we are going to pull MinMaxScaler from scikit-learn, instantiate MinMaxScaler and then put some data into MinMaxScaler so that it (MinMaxScaler) can do its thing with said data.

When we put a variable (the data from a column) into MinMaxScaler, it takes each value in that column, subtracts the min value of the column and then divides by the range.

y = (x - min) /(max - min)

From my research, a favorable trait of MinMaxScaler is that although it scales the values to the 0–1 range, it doesn't mess with the variable itself too much, meaning it preserves the shape of the original distribution.

Below, I show that once MinMaxScaler is defined, I call the fit_transform() function and use it on my main dataset (x) to create a transformed version of my main dataset (x_scaled).

#import preprocessing, object housing MinMaxScaler
from sklearn import preprocessing #take desired columns from df_teamavg, convert the df to a numpy array with .values method and store it as a variable, x
x = df_teamavg[columnslist_normalized].values 
 
#define a MinMaxScaler instance with default hyperparameters and store it as a variable, scaler
scaler = preprocessing.MinMaxScaler()
 
#call fit_transform() function and pass it to my dataset, x, to create a transformed version of the dataset, which will be x_scaled
x_scaled = scaler.fit_transform(x)

#convert dataset x_scaled, still a numpy array, back to a pandas DataFrame, X_norm, so I can keep rocking with it 
X_norm = pd.DataFrame(x_scaled)

#make a copy of original df_teamavg because I dont want to “back-modify” it 
df_teamavg_normalized = df_teamavg.copy() 

#take X_norm DataFrame (which has all the nice normalized data) and “pass” its data to the recently-made df_teamavg_normalized, only inserting the numeric data 
df_teamavg_normalized[columnslist_normalized] = X_norm

#since X_norm and df_teamavg_normalized[columnslist_normalized] have the same dimensions they fit together
#I regain the Team column via df_teamavg_normalized

Normalized Output:

Improvement Note: a friend and expert data scientist told me that the above bit of code is somewhat complex and arguably should be a function — one with inputs (the original DataFrame and columns to be transformed) and an output (the new DataFrame). I havent done that yet, clearly, but i'll get there :)

(2) Running the K-means Algorithm

👷🏾 (2A) The Coding of K-means

Now I run the k-means algorithm on the normalized data. First, a quick attempt at explaining some of the terminology and code below (thanks to a friend's "cookie" analogy).

kmeans = an object, think of it as a “cookie with certain magical traits”
KMeans = a class, think of it as a “cookie maker” or a “factory of different types of kmeans cookies”

import sklearn.cluster as cluster#make a kmeans “cookie” from the KMeans “factory” and instantiate it/give it attributes of 4 clusters (n_clusters=k)
kmeans = cluster.KMeans(n_clusters=4,init=’k-means++’,max_iter=100,random_state=22)
 
#fit my input data to the kmeans model
kmeans = kmeans.fit(df_teamavg_normalized[columnslist_normalized]) #get cluster labels based on how the model groups my data (my teams, based on their average stats)
labels = kmeans.predict(df_teamavg_normalized[columnslist_normalized])

I could have also done the above by writing the following code,

kmeans = cluster.KMeans(n_clusters=4).fit(df_teamavg_normalized[columnslist_normalized])

, but I wanted to break it into step-by-step parts to make sure I was following what was going on and so that I could explain it to others more easily.

Important Note on my choosing 4 clusters: there are several methods available to estimate the optimal number of clusters. The elbow and silhouette methods are most common. While not perfect, they are good a starting point and sanity check when used in conjunction with subject matter knowledge and project goals. I have executed both methods in other situations, but I elected not to get into the details here as it was not my main focus. Here, I chose 4 clusters because I felt that more that 4 would make analysis confusing and spread the data too thin. Additionally, 4 is half the number of tournament groups (8) and also potentially forces interesting pattern finding among the 6 FIFA continental zones that participate in the tournament.

I now add a column called “Cluster” to my dataframe. “Cluster” is going to populate the dataframe with the clusters that the kmeans algorithm assigned to each team (recall "labels" from the code snippet above). For visualization purposes I use the .pop() function to move the cluster labels to the start (the left) of the dataframe, next to the “Team” column. Then I sort by cluster.

df_teamavg_normalized[‘Cluster’] = kmeans.labels_first_column = df_teamavg_normalized.pop(‘Cluster’)#insert column using insert(position,column_name,first_column) 
df_teamavg_normalized.insert(1, ‘Cluster’, first_column)df_teamavg_normalized = df_teamavg_normalized.sort_values([‘Cluster’], ascending=True)

A view of the output: the normalized average stats of each team and their assigned k-means clusters

Fruit of the transformations and k-means process — normalized average stats of each team and their assigned k-means cluster (second column)

Because there are 30 teams and still lots of variables, I quickly transferred some of the data to excel to facilitate visualizations.

A last adjustment before analyzing the clusters - I want to be able to see each team, and its cluster, connected to the the non-normalized data (which is easier to interpret) in addition to the normalized data. The normalized data served its purpose in that it made clustering easier, but analyzing differences between clusters and drawing conclusions will be easier with the original, non-normalized data. To do this, I create a new, small dataframe that I can "attach" to the front of the non-normalized stats. This new "attachable" dataframe is called df_team_clusters.

I "attach", or merge, the above dataframe (df_team_clusters) to my earlier dataframe of non-normalized stats (df_teamavg) to create df_teamavg_withclusters

df_teamavg_withclusters = 
df_team_clusters.merge(df_teamavg, on=”Team”, how = ‘inner’)

df_teamavg_withclusters contains teams, their clusters and their average, non-normalized stats.

In order to get a look at the average stats of each cluster I use grouby.()

Cluster_StatsActual = df_teamavg_withclusters.groupby(‘Cluster’, as_index=False).mean()Cluster_StatsActual

Below I show (1) the clusters and their respective teams after some formatting in excel, and (2) the excel-modified output of Cluster_StatsActual, showing average stats per game for each cluster.

Excel-modified output of Cluster_StatsActual, showing average stats for each cluster

🔮 (2B) Analyzing the Output

Before visualizing the data, I want to address the clusters produced.

Cluster 0 - "target practice" teams that clearly struggled most in the Cup:

only one of its member teams, Denmark, got out of the Group stage
performed worst in stats related to goals (Attempts, On Target %, Corners)
always “on the defensive”, recording the most clearances, fouls, blocks

Cluster 1 - “powerhouses” with different journeys , unified by key stats:

score high in the “attractive football” categories — logging the most Possession, Passes, Attempts, and Corners
ran (distance_km) quite a bit, so we cant say they weren’t trying
didn't recover the ball very much (this could be because they had it more) and they were not as efficient with their chances as Cluster 2
do a lot of things better than field, but not enough to get to the Semis

Cluster 2 - “get it done” teams that advanced most and did well overall

didn't turn heads with “attractive football”, but put the ball in the net
ran hard, fouled hard, and recovered the ball well
on_Target_% and Goals_delta stats were tops - they were mad efficient
makes me wonder if Goals_delta was an overpowering variable

Cluster 3 - “good” teams that, by the stats, didn't do anything remarkable

achieved 2nd or 3rd place in all categories, which is not…"bad"
do not seem to like running
give up the most goals

There were a few things in the data that peaked my curiosity, made me wonder about the impact of outliers.
1) Who is pulling the distance_km stat up so high for Cluster 2?
2) What were Sweden, Russia doing so well to get that far in the Cup?
3) Did France, Croatia, Belgium really foul that much?

To quickly try to gain insight into these questions I filtered the dataframe to look at specific teams and compare them to the aggregate data. I used the following code to filter, for example.

df_teamavg_withclusters[(df_teamavg_withclusters.Team == ‘Sweden’)| (df_teamavg_withclusters.Team == ‘Denmark’)|(df_teamavg_withclusters.Team == ‘Croatia’)|(df_teamavg_withclusters.Team == ‘France’)|(df_teamavg_withclusters.Team == ‘Russia’)]

Some answers to my above questions:
1) Russia and Croatia ran their assess off, averaging 125km and 118km per game, respectively. Russia, the host, punched above their weight throughoutt the Cup (likely due to being hopped up on excitement etc). Croatia just seemed to roll like that - they ran all the time, as was evidenced by their multiple come from behind wins. They really were the hard-working, cinderella of the tourney. France, the champion, actually ran much less that others, with 101km per game, on average.

2) Sweden really showed nothing too impressive… oh wait, they nearly had the highest average On_Target_% and were shot blocking machines. Russia was a typical Cluster 2 team and their hard, steady play worked out for them.

3) Yes, France, Croatia, and Belgium (3 of the 4 semi-finalists) fouled a lot, between 13 and 16 time per game. That is higher than I would have imagined for such skilled teams.

(3) 2x2 Visualization Comparisons

📈 (3A) Importing Viz Libraries and Plotting

Now, on to what I enjoy even more than clustering - the visualization of clusters and their related data. Here, I write code to visualize each team (tied to their clusters by color) in 2-by-2 plots.

First, I import the necessary libraries.

import seaborn as sns
import matplotlib.pyplot as plt

The code below contains the parameters needed to produce a readable visualization of clusters grouped by color. It also contains a for loop / iterator that allows me to print the name of each team near its respective data point.

The first question I was trying to answer was “Does Aggression pay off”?
In other words, do teams that foul more score more goals?

#viz parameters 
plt.figure(figsize=(20,12))
sns.scatterplot(data=df_teamavg_withclusters,x=’Fouls’,y=’Goals_For’, hue=’Cluster’, legend=True, alpha=1, palette=”bright”, s=50)#for loop to print the name of each team near dots
for i in range(df_teamavg_withclusters.shape[0]):
plt.text(x=df_teamavg_withclusters.Fouls[i]+0.09,y=df_teamavg_withclusters.Goals_For[i],s=df_teamavg_withclusters.Team[i],fontdict=dict(color=’black’, alpha=1, size=10))
 
#set x limit
plt.xlim(df_teamavg_withclusters.Fouls.min()-0.1,df_teamavg_withclusters.Fouls.max()+1)#title
plt.title(‘“Converted Agression” — Goals For x Fouls (Actual)’, fontsize=20) #x label
plt.xlabel(‘Fouls’, fontsize=15) #y label
plt.ylabel(‘Goals_For’, fontsize=15) 
plt.show()

There is no strong correlation between fouls and goals. But we can see that Russia clearly outperformed with respect to “converted aggression”. Croatia did as well, to a degree. Morocco and Korea were just hacking for fun it seems, while England and Spain were nice guys.

My second question was “Does running (distance_km) pay off in the form of a good Goals_delta stat?” In other words, are the teams running the most outsourcing their opponents? (I wont insert the code here — it is the same as above except for the new parameters related to this question).

There seems to be a slight, slight positive correlation… but it's a stretch. Another potential stretch — this particular 2x2 sort of shows me the clusters grouped together (greens near greens, reds near reds).

What this definitely does show that intrigues me is that teams like France, Belgium, and Brazil did not run an absurd amount but were still able to outscore teams. On the other hand, it seems Russia, Croatia and England really "worked hard for their money". Remember that song?

My third question was “Does passing perfection tie to shooting perfection?” In other words, if a team has great pass accuracy are most of their shots on target?

Again, it seems this 2x2 sort of shows me the clusters grouped together. What is noteworthy here is something I mentioned earlier. Fucking Sweden! They are an outlier of sorts — they cannot pass for shit, but man their shots are on target. Peru surprised me too! Croatia, interestingly enough, was not an overly accurate team — Modrić tried his best, but he couldn’t do it alone.

After seeing these 3 visualizations I can see why France got it done. They are always among the top performers.

The last question that interested me was "did the amount teams ran change over the course of the Cup?" Remember when waaay back in the beginning of this article I connected each team to its furthest stage in the tournament? Now, I join that data (“Round”) with the cluster data in order to create a new dataframe, df_withclusters_rounds. With this I will be able to visualize a bloxplot showing me each round of the Cup and how much the teams that got to that round were running. Each dot represents a team at its furthers stage and the color of that dot represents their cluster.

plt.figure(figsize=(20,5))
 
#define data to populate boxplot and formatting
ax = sns.boxplot(data=df_withclusters_rounds, x=’Round’, y=’Distance_km’, orient=’v’, color=’lightgray’, showfliers=False)
plt.setp(ax.artists, alpha=0.5)
 
#Add in points to show each observation
sns.stripplot(x=’Round’, y=’Distance_km’, data=df_withclusters_rounds, jitter=True, size=6, linewidth=0, hue = ‘Cluster’, alpha=0.7)
 
ax.axes.set_title(‘Distance per Round with ClusterDots’, fontsize=30)
ax.set_xlabel(‘Round’,fontsize=20)
ax.set_ylabel(‘Distance_km’,fontsize=20)
 
#Define where to place the legend
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)plt.show()

Nothing too earth-shattering here, but let's finish it out. In the Group stage and the Round of 16 teams run about the same amount. Seems like in the Quarterfinals they really take their foot off the gas (this surprised me). However, maybe a better interpretation is that teams just condensed to the same amount of running. See that green dot way up high in the Quarterfinals column? It has got to be crazy running Russia or Croatia, both had overtime drama. In the Semis and Finals the median and the upper limit pops up — perhaps because teams are so close to total victory that they run until they drop.

Wow. Well, that was a mouthful.

I hope this, one of my early attempts to put Python into practice with sports data, was at least a little bit interesting to someone. If you made it this far, I hope you enjoyed it. Until next time.