Introduction to Clustering — An Unsupervised learning Algorithm

Can machines identify patterns on their own? Clustering don’t need no education!

Published in

AI Graduate

9 min readFeb 1, 2019

Our brain is a master in identifying patterns in our sensory inputs. Almost unknowingly we find patterns in the most obscure things. You can quickly figure out the rhythm in a song and tap along even for a song you have never heard before. Looking up to a cloudy sky you can see different types and shapes of clouds even if you don’t have a degree in meteorology. You have been doing such amazing things since you were a small child.

Consider a scenario, where a child is shown the following images of birds and animals explicitly specifying whether it is an animal or bird. Now the child learns enough about each category that when it is shown new images, like the ones on the right, it is able to identify them as Animal or Bird. This is an example of a Supervised Learning. Examples of Supervised Learning Algorithms are Decision Trees, Linear Regression and Logistic Regression

Now, on the other hand consider a scenario, where the child had no learning phase and was shown the images in Figure 1(a) without the labels. In this case when a child is asked to classify if an image is bird or an animal, he lacks the information that can help him do so. The best he can do is come up with the following groups, based on common patterns(wings and legs for example). This explains how Unsupervised Learning works. We show a lot of data to our algorithm and ask it to find patterns in the data by itself.

When you have a lot of data but no labels for the algorithm to learn and evaluate from, AND the machine uses just the input data to find patterns it’s called Unsupervised Learning.

Source: https://www.slideshare.net/RMitra1/machine-learning-disrupting-car-insurance-industry/

There are many algorithms that form part of Unsupervised learning like Clustering and Anomaly Detection. In this article we will cover the basics of a fundamental but very useful algorithm called Clustering.

To understand Clustering, let’s walk through a simple real life example. A mother asks her two children to arrange the following pieces of playing blocks

The children come up with 2 different groups, as shown in the below figure, with different “similarities” in the blocks. This is clustering! Each of her children came up with a different type of grouping. One child grouped them based on the shape whereas the other grouped them based on the color. There is no right or wrong way!

In fact, there may be lot of different ways(a.k.a similarity measures ) data can be arranged in different groups — size, shape, color, texture and other complex features.

**Figure 4.** Clustering of playing blocks

Then how does one pick one set of clusters over the other? This will depend on the similarity measure used by the mother in this case. If the similarity measure was the blocks that have same shape, then we know that the arrangement of child-1 is better than child-2. However, if the similarity measure was the blocks that have the same color then the arrangement by child-2 is better. Hence, it is important to define the similarity measure when performing clustering.

Mathematical Walk-through of an Example Clustering Algorithm

Now we understand in broad terms what clustering is and how we can distinguish a good set of clusters from bad. Let’s build on this foundation and understand a very popular clustering algorithm — K-Means.

From the previous example we know that, we need to group those points together that are more similar. In K-Means, the similarity measure is the distance between the data points. Smaller the distance, more similar the points are, as shown by the below equation —

**Equation 1.** K-Means Similarity Measure

**Figure 5.** K-Means Clustering Algorithm

The following animation shows the steps in K-Means in action with K = 2. K stands for the number of desired clusters we want to separate the data into!

Source: https://www.toptal.com/machine-learning/clustering-algorithms

Let’s walk through the following toy example step by step using the algorithm in figure 5. The 10 data points that we want to divide into K=2 clusters are shown in the figure 6.

**Figure 6.** Data points for clustering

Iteration 1

Step 1 : Let’s randomly assign 2 points in this dataset as centers — (1,1) and (-3,2). These points will act as the centers of the two clusters that we want to create.
Step 2: Now, measure the distance of each of the points from the 2 centers picked in Step 1 and assign it to the cluster of whichever center they are the closest to — i.e. the center with highest similarity (least distance). In the image below each point can be seen turning yellow or red based on which center it is the closest to.
Basically step 1 computes cluster centers and step 2 divides all the points in the data into these two clusters. Splitting the data into these K=2 clusters is done on the basis of distance from the center. If point A is closer to center C1 than center C2 then Point A will belong to Cluster 1 (= C1’s cluster).
The GIF Below Shows the First Iteration. Iteration starts when all points are blue in color.

First Iteration in Action. Carry on to understand the mathematics behind it!

Step 3 : Calculate new group centers, by taking mean of all points belonging to the same group. Hence the new centers are (1.71, 0.57) and (-3,-1.33)

Iteration 2

Using the new groups centers (1.71, 0.57) and (-3,-1.33), again assign the points to the center with highest similarity (lowest distance).

**Figure 8.** Centers (1.71, 0.57) and (-3,-1.33)

Now repeat Step 2 and Step 3 with new cluster centers, it will result in the new groups centers as (2.33, 1.33) and (-2.75, -2) but will yield the same clusters as in the previous step. i.e. even after the iteration the coloring of the points doesn’t change which means that each cluster has the same points in it before and after the Iteration!

**Figure 9.** Centers (2.33, 1.33) and (-2.75, 2)

Since the cluster assignments have stabilized i.e. the cluster assignments don’t change anymore, we can stop the iterations and report these as the final clusters.

It is important to note that the final clusters depend on the initial centers. Try starting with different centers and see if you land with different clusters than above.

How to choose K?

How many clusters do you think there are?

As you have seen, in the example, we walked through, we had a prior knowledge of number of clusters (K=2) that we want our data points to be grouped into. In cases where this prior knowledge is missing, there are multiple methods available to determine the number of clusters.

Some would answer 2 and others would answer 4. How, then, do we find the optimal number of clusters?

Visualization: Sometimes you can determine the number of clusters by plotting the data.
Domain Knowledge: The knowledge about the application guides the choice of number of clusters. e.g. Spam detection will have K = 2 i.e. Spam and Not Spam.
Data Driven Methods: Some methods calculate metrics for different values of K to determine the best selection for K e.g. Elbow Method, Silhouette Scores etc.

Elbow Method helps us to determine the optimal number of clusters for a given dataset. As the name suggests and as shown in the figure 10 the elbow of the graph gives the best number of clusters for a given data-set.

**Figure 10.** Elbow Method to determine the best number of clusters

The marginal change in the objective function is very small as the number of clusters is increased beyond the elbow point. Thus making it the optimal operating point.

Objective Function: A function which helps us decide the most optimal point. e.g. Deciding when to brake is an objective function for F1 Race cars. You do it too early you lag behind and you do it too late you can miss a turn.

To find the best number of clusters for our example in previous section, we will optimize (minimize in this case) for within cluster sum of squares or WSS, which is the sum of squared distances of all points from their cluster center.

To do this, we will calculate WSS for k=1 to k=10 and plot a similar graph as in figure 8. Let’s walk through for calculating this for k=2. For this refer to the figure 7(a).

Let’s sum the square of all dark blue values, which represent the distance of the points from the center of their cluster.

Similarly, we can calculate WSS for other values of k, for the final clusters, which will give us the following plot. It shows that the optimum number of clusters for the given data-set is 3. To find the optimum number of clusters in the elbow plot, draw a line from point 1 to point 10. The cluster point with the largest distance from this line is the elbow point.

Additionally, note that sometimes the graph doesn’t have a clear elbow and we see a fairly smooth curve such that it is unclear to choose an optimal value of k. In these cases, we might try either of the following things:
1. Try changing the objective function — for example silhouette scores
2. Re-evaluate if clustering is the right approach for the given data-set.

Applications of Clustering

Why did we go through all this exercise? How does clustering help us solve the real issues?

Search Engines — They try to group similar objects in one cluster, and provide results for the searched data according to the nearest similar object which are clustered around the data to be searched.
Spam Detection — We take the initially labelled spam and genuine data and apply clustering on the entire data-set. The unlabeled spam items will end up in the clusters along with the labelled spam items. Using the labelled spam data we can then label the unlabeled data in the same cluster. Thus it helps us separate unlabeled data into spam and genuine.
Customer Segmentation — Telecom providers can cluster prepaid customers to identify patterns in terms of money spent. This would help the company target specific clusters of customers for specific campaigns.

This article covers various other use cases of clustering. I would highly recommend to read through it to get a better understanding of how clustering can be useful in real world examples.

X8 aims to organize and build a community for AI that not only is open source but also looks at the ethical and political aspects of it. More such experiment driven simplified AI concepts will follow. If you liked this or have some feedback or follow-up questions please clap and comment below.

Thanks for Reading!