Introduction to Machine Learning Pt 4

written by Stephen Wilson

Previous blog posts in this series introduced the concept of supervised learning and outlined two examples of supervised learning, namely linear regression and logistic regression. This final post in our introduction to machine learning concepts will teach you about the principles underlying unsupervised learning.

In contrast to the supervised learning approach of regression and classification, unsupervised learning tasks work with unlabelled data (which is where the name comes from: The system learns from data without having ground truth labels in the training set to guide or “supervise” it). Despite this, the same principles underlie unsupervised learning approaches. We still need to give the system:

· some starting point

· some way for it to know if it is making mistakes

· a mechanism for correcting mistakes, so that it becomes better with each iteration

· a way to know when to stop.

Unsupervised learning tasks do not involve trying to predict a continuous variable based on some input (as in regression) or predict a binary class label (as in classification) but rather are concerned with uncovering some structure or pattern in the training data. A common unsupervised learning technique is “clustering”. The sections below will provide some intuition behind what is involved in applying clustering to a toy Scout24 example.


Clustering refers to a collection of related techniques that seek to automatically group data into smaller sub-groups called “clusters”, where it is assumed that the members of each cluster are similar or related to each other in some way. At ImmobilienScout24 we might like to automatically group the users of our platform according to the property listings that they search for and look at it. Grouping similar types of users together can help us provide a better and more personalised service to them, as we can look at the group as a whole and make recommendations to its members based on what other people generally like.

K-means Clustering

We’ll now outline the steps involved in one of the most commonly used clustering algorithms: k- means clustering. The name might seem a little strange but makes sense when you find out that the k simply refers to the number of clusters we would like the algorithm to find and “means” is just the plural of “mean” as in average. The algorithm is simply going to find k clusters in a data set and use the geometric mean applied to groups of the data in order to help find stable clusters.

The figure below shows the plot of some data set. For simplicity and ease of explanation we are showing the data in two dimensions, though it is important to remember that in real-life scenarios the data will typically be in a higher dimensional space.

The task is to categorise the data into sub-groups or clusters, based on how similar the data points are to each other. In this instance, similarity is measured in terms of space: data points that are spatially close to each other are assumed to be more similar to each other than data points which are further away. The way the k- means clustering algorithm works is it tries to iteratively find k points in the feature space, such that each of these points becomes the geometric centre of a cluster where each data point in the training data is closer to one of the k centres than to any other. The geometric centre of a plane figure is also called its centroid and as you will usually see this term in any discussion of k- means clustering, we will use this term in the rest of this post.

Telling the system how it should start

As we have seen in the previous posts on regression and classification, in machine learning we need to give the system a starting point so it knows where to begin. As in the previous examples for regression and classification, it is reasonable to begin by randomly assigning some values, so we will simply randomly pick the k observations from the data set and just these as our centroids.

The algorithm then measures the distance between each data point in the training set and each of the centroids and assigned to the cluster of the nearest one.

For each cluster, the centroid is computed by simply calculating the geometric mean of all points in the cluster. The distance is then measured between all data points in the training data and these new centroids. Data points are then (once more) assigned to the cluster of the nearest centroid. Some data points will move clusters, some might stay in the initial cluster. The centroids of these new clusters are again calculated and the whole process repeats iteratively until we reach a point where no data points move clusters. Then we can say we have found a stable configuration of clusters for our data. Of course, because we randomly selected the initial centroids, if we ran the clustering algorithm again we would begin with different randomly selected centroids, which means that the resulting clusters will be slightly different each time.

The figure below shows the starting 3 centroids chosen at random, then their position after 1, 5 and the final iteration, respectively. The colours represent the clusters to which the data points have been assigned, based on how close they are to one of the centroids. Note how the centroids of the 3 clusters move from iteration to iteration by tracking how the red dots change position in each of the figures. Note too how data points change cluster from iteration to iteration as the algorithm finds centroids that group the data better (take a look at how data points change colour from iteration to iteration).

Choosing a K

One of the challenges with k- means clustering is that very often it is unclear what a suitable value for k is. If we choose too low a value for k then we will potentially have a small number of clusters each containing a large number of data points. In some cases, having a small number of clusters means that the underlying structure of the data that we are trying to uncover is obscured. Choosing too high a value for k also brings challenges. Imagine if the value of k were close to the number of data points in the training set. This would mean that almost every cluster would consist of a single data point. In this instance the underlying structure in the data is also obscured.

Luckily, there is a method that can help guide us toward finding a good value for k called the elbow method.

The basic idea of the elbow method is that we apply the k- means algorithm to the same data set but for a range of values of k (for this example, let’s imagine that we want to find the optimal value for k in the range 1 to 10).

We first start with the lowest value for k in our range and apply the algorithm. For each cluster we compute the cluster mean and then for each data point in that cluster we compute the squared error by subtracting the data point from the cluster mean and multiplying the result by itself. Subtracting the data point from the mean allows us to measure how “far away” the data point is from the mean, while squaring it ensures that we will end up with a positive value. We then add up all the squared errors and make a note of it.

The intuition behind this is that for clusters made up of very similar data points then the difference between each data point and the cluster mean will be relatively small and consequently the sum of squared errors will be smaller than for clusters made up of more dissimilar data points where the difference between data points and the mean are larger. Take a look at the figures below to see the difference (cluster mean is shown in red).

As we increase the value of k we also increase the number of clusters and consequently the sum of squared errors will decrease (the more clusters we have, the fewer data points each cluster will have so they will naturally be closer to the cluster mean and therefore the sum of squared errors will be smaller).

If we then graph the different values of k against their corresponding sum of squared errors we can then see clearly how the value of k influences the sum of squared errors. Take a look at the figure below.

What is also clear from the graph is that reduction in the sum of squared error tails off after a certain value of k and that we don’t really gain much by increasing the number of clusters. If you use your imagination, you can see that the graph in the figure above resembles an arm that is bent at the elbow. In the elbow method our aim is to identify where this elbow is and take the corresponding value of k as our optimal number of clusters. In the figure above, we can see that there is a clear “elbow” at k= 3 and that after this the drop in the sum of squared error tails off. In this instance it would be reasonable to take 3 as our preferred value for k.

I am one of the Data Scientists in Residence and work mainly for Scout24’s real estate platform ImmobilienScout24. I have a PhD in Computer Science and a background in computational linguistics and speech processing. The Data Science Team is hiring! If you have a passion for machine learning and data science (and like to inspire that same passion in others) then come and join our team. Open positions can be found here.