Two Minute Thesis: Clustering Algorithms

4 min readMay 30, 2015

I completed a year long undergraduate thesis as part of my Engineering degree. Focussing on a single project for that long was a valuable and challenging experience. Like I suspect many undergraduate theses it taught me the effort required to create something genuinely new through research.

At the end of the year I had a ~100 page LaTeX document with a good overview of existing research and some parts which I felt were genuinely new.

However where I have found my thesis invaluable is during job interviews. In every job interview I have either been asked directly, or mentioned as part of an answer, to give an overview of machine learning and clustering. This has been a great differentiator between candidates who either completed a thesis they don’t care about, or didn't do one as part of their degree. I've got this two minute thesis pitch down to a fine art now, so thought it might be useful for some as an introduction to machine learning and clustering.

What is Machine Learning?

Machine Learning is solving problems which cannot be solved algorithmically. An example of an algorithmic problem is a payroll. Each time payroll is calculated the same steps can be followed to generate the correct answer. An example of a problem which can’t be solved algorithmically is determining someone's age from a photo of them.

Two Types of Machine Learning

There are two main types of machine learning, supervised and unsupervised. Supervised learning starts with an existing set of ‘labelled’ data, data with known solutions. For example to determine someone's age you would start with a collection of photos labelled with the age of the person. You would then analyse the images to extract attributes, and find correlations between those attributes and the person's age. Once you have built a model based on these attributes you would test this model against more labelled photos and see how close the model came to the labels. You can objectively calculate how well your model works by summing the absolute difference between predicted age and labelled age. Once you are confident in your model you can start applying it as a predictor for unlabelled data.

Unsupervised (or exploratory) learning involves building models without starting with labelled data, and this is where clustering comes in.

What is Clustering?

Clustering is taking an unlabelled dataset and breaking it into clusters of similar data. The classic example of clustering is the k-means algorithm, which is quite simple to explain intuitively.

Say you start with a collection of data with x and y coordinates. You start by deciding how many clusters you want to generate, this is known as k. You place k random cluster centers within your data. Then for each point of the data you calculate which center is closest by using a distance measure for example straight line distance. Once all points are allocated to a cluster center your move the cluster center to middle of those points, by taking the average x and average y of all assigned points. This process repeats until the cluster centers stabilize.

The result is then a partition of the data, that is each data point is allocated to a specific cluster.

Say you run an online store which sells wine, how would you apply clustering?. Firstly the store would generate a report linking customers to purchasing, and adding attributes to each purchase, for example the origin of the wine or the discount applied. Then each cluster would be analyzed individually. For example one cluster may only buy wine on sale, another may only buy French wine. When advertizing to these customers instead of sending one email to all customers you would send an email with wines on sale to one cluster, and an email with only French wine to the other cluster.

So what did you do in your thesis?

The issue with clustering is its highly subjective, and there is no perfect way to measure effectiveness. Just in the simple example above we had to pick a value of k, pick an algorithm to use, a distance measure and then analyze the result to see if the clusters made sense.

I performed a research study on various heuristics to make this process less subjective. I then combined the best methods from existing research into an end to end process and tested this process on some commonly used machine learning datasets.

In broad terms this process involves:

Selection of objects and variables — removing outliers and selecting variables with a high effect on the clustering result
Normalization, Distance Measure and Clustering algorithms— testing a combination of normalization, distance and clustering methods
Partition Stability and Comparison — Comparing clustering results using heuristics

The takeaway

The vast majority of businesses could benefit from simple clustering using k-means, followed by investigating the results manually. This enables targeted marketing and a way to understand trends in your data. Unsupervised learning has quite low barriers to entry as you dont need to know what you are looking for, making this applicable to a wide range of datasets.

If value is found in simple clustering methods there are a number of ways to improve this result, without having to subjectively select and optimize parameters along the way. Additionally there are ways to rank the results of these methods without individually analyzing each one.

References

The wine metaphor above is from the excellent Data Smart by John Foreman. This is a highly recommended introduction to Data Science.