K-Means Clustering: The Premier League

Editor: Ishmael Njie

DataRegressed Team
DataRegressed
4 min readOct 1, 2018

--

Premier League

Clustering is a branch of unsupervised learning, that uses unlabelled data to find some underlying structure to a dataset. The problem here is that unlabelled data contains many features with no specific labels to distinguish the instances. Clustering aims to form groups, depicting a sense of similarity. In this article, we will be implementing a clustering algorithm to find labels to distinguish teams in the Premier League.

The dataset can be found here. It is the statistics of the teams in the Premier League from the 2006/2007 season to the 2017/2018 season. With this, we can ask the question: What teams have consistently performed at a high level in the Premier League?

The K-Means algorithm is a popular clustering method that iteratively assigns data points to a number of clusters, K. The algorithm will create groups of similar instances by forming clusters with points that are close to each other. These clusters that are formed are to minimise the sum of squared distances:

Cost function for K-Means

Where mu is the mean of the data points (Cluster centroids) in cluster c. By minimising the cost function, the K-Means algorithm aims to form clusters of data points that are similar given a specified feature vector.

Let’s get into how the algorithm works:

Step 1: Initialize K cluster centroids. Given a value of K, centroids are randomly chosen as the real location of the centroids are unknown. This is shown in subplot b).

Step 2: Classify data points to the cluster with the nearest centroid. Subplot c) shows the assignment of the data points to the centroids that is closest to it, via the Euclidean distance.

Step 3: Re-compute the centroids to be the mean of the newly assigned data points. Subplot d) shows the movement of the centroids.

Step 4: Repeat steps 2 and 3 until a conversion criteria it met. In this case, it will be where the centroids stop moving.

Let’s look to implement this:

Sample of the data

The Premier League is regarded as the most popular football division in the world. The data details statistics of all teams that have appeared in the Premier League from 2006 to 2018. With this data, we could find some intuition of the data that is not entirely presented in the dataset. The dataset provides data on the goals scored inside and outside of the box, the losses of each team in a season as well as the number of clean sheets in a season; which can all contribute to a team’s performance in a season. For this article, we will look at the wins and the goals of each team as key indicators of a team’s level of performance.

Prem wins vs Prem goals

Here we can see a plot of the total Premier League wins against the total Premier League goals. Based on these two features, we will find some clusters to portray how well teams have performed in relation to each other.

First of all, we need to set a value for K. The value for K will give different results in every different case. A way of finding a value for K is to use the Elbow Method.

Elbow Method

The plot shows the K-Means algorithm being run many times on the same dataset. It plots the number of clusters, K, against the corresponding sum of squared distances (cost function) from each value of K. The point that is key is the point that shows an ‘elbow’. The idea is that after this point, the following values for K will tend towards zero. Here, the value of K that seems optimal is 2.

K = 2 with centroids plotted.

Having run the K-Means algorithm with K = 2, we now have formed 2 groups for Premier League performance. Straight away, one can see that the cluster at the top right is far less than those in the bottom left. A way of interpreting this could be that the teams at the top right are fairly high level, consistent performers each season, and those in the bottom left are not outstanding performers and are at risk of or have been relegated; this result can be interpreted further.

Thank you for reading! Find the Machine Learning repo for more tasks like this here! The accompanying notebook for this article can be found there aswell.

--

--