Your Guide to Unsupervised Machine Learning — Clustering

Harshit Yadav
CodeX
Published in
10 min readFeb 16, 2022

Building Intuition and Logic for Clustering along with Visualisation of K-Means algorithm on a dataset

Source — Link

Introduction

In previous blogs (Getting Familiar to the World of Machine Learning), we learned about Unsupervised Machine Learning. Unlike Supervised Machine Learning in this, we identify the data points in relation to other data points because this type of Machine Learning Algorithm does not make use of Labelled Data as it does in Supervised Machine Learning. If you need more clarification on the topic, try reading the “Unsupervised Machine learning” section of (Getting Familiar to the World of Machine Learning).

Unsupervised Machine Learning

Unsupervised Machine Learning is used when we need to find out underlying hidden patterns in an “Unlabelled Dataset”, Unsupervised Machine Learning is further classified into two sections:

  1. Clustering — Grouping of similar data points to form different cultures.
  2. Association — Finding important relations between data points.

In this blog, we will be learning how clustering works and what methodologies can we use to perform clustering over Unlabelled Data.

Clustering

Suppose you are presented with a basket full of flowers, and the basket can contain any number of different categories of flowers. For example, the basket may contain 5 flowers of type-1, 3 flowers of type-2 etc. The catch here is that you have Blindfolds on, and you can not see and distinguish between categories of flowers. Now how would you differentiate one type of flower from others?

Since you can not take the blindfold off, one way to distinguish between different types of flowers is to pick one flower and place it out of the basket, pick another flower and try to feel if it is similar to the first one. If it is similar, keep them together. If it feels different keep, it separated, then pick the third one and try to feel it against other categories.

When you finish, you will end up with different cultures. This is what clustering is, you compare one data point with respect to other data points and form a cluster.

Top clustering-based algorithms

There are different methodologies to perform clustering over unlabelled data. The method you used to segregate different categories of flowers with the blindfold on is one of the methodologies. Similar to that, there can be many ways to form clusters. In machine learning, these methodologies are known as algorithms, and there are different algorithms for clustering like:

  1. K-Means Clustering
  2. Mean Shift clustering
  3. DBSCAN
  4. EM using GMM
  5. Agglomerative Hierarchical Clustering

In this blog, we will be covering the K-Means Clustering algorithm.

K-Means Clustering

Suppose you have a dataset that contains features of different categories of flowers. Still, you do not know how many categories are there and which data point belongs to which category, so to find the categories, we perform K-Means clustering over it to learn more from the data.

Each data point represents the flower with given PetalLength details and the PetalWidth details of the flower, and each data point might belong to some category of flower that we do not know about.

By Author

Look at the images below. This is the petalWidth VS petalLength plot of the dataset of flowers.

petalLength vs petalWidth By Author

In this, all the blue dots represent a flower based on two features that are petalWidth and petalLength. Now we do not know how many types of flowers there are as all the data points look the same with different combinations of petal length and petal width.

Try finding out which data point belongs to which category. How would you do that?

The first thing you would want to start with would be to answer how many categories this data contains because without knowing how many categories are there, how would you put the data in categories.

Now suppose i tell you that there are 2 categories this data represents, say A and B. Now i want you to find which data points can be clubbed together to fall under category A and which one would fall under category B, so to answer that, we will need to find out the clusters and the exact centre point of the cluster around which the data points would be identified, this centre of the cluster is called the centroid. (You will learn this better as you move further in this blog).

So, In K-Means clustering, we are focused on finding two main things that are-

  1. How many clusters are there in the data — There are some methodologies for finding the correct number of clusters in data, like the Elbow and Silhouette methods. We will get to that part later in this blog.
  2. Where are the centroids of the clusters formed— Suppose we find out that there are x categories in our data. The next thing that we want to find is where do the centres of all those x clusters lie and what data points fall under which cluster.

We will be explaining the second one first. We will learn about the centroid of the clusters, and then we will learn how to find the correct number of clusters.

Learning to form clusters of our data

By the image above, we may assume that there are two clusters. However, this might be wrong, and we can’t simply guess it with a fluke. We will learn later how to be mathematically sure.

For the time being, let’s assume that there are originally 2 clusters in our data which means that all our data points belong to one of these two clusters.

Step 1

We randomly pick two data points and assume them to be at the centre of the clusters we are looking for. It would look something like this-

By Author

In this, we assume our centroids to be at [0.42372, 0.375] and [0.8305, 0.833] now. We need to find which of the data points belongs to which of the two clusters. Let's learn that in step 2.

Step 2

To find the belongingness of a data point, we find the distance between the data point and the two cluster centroids, the cluster centroid, which gives the minimum distance is the cluster to which our data point belongs. See the image and explanation below:

By Author

Let us take the point marked yellow. We need to find which cluster does this point belongs to. To do that, we find the distance between the point and the cluster centre, and the one with the minimum distance is selected as the one our data point belongs to. After doing this, we will get which data point belongs to which cluster over all the data points. It will look something like this:

By Author

Now we have identified which data point belongs to which cluster. As we merely guessed the centroid of our two cultures, it may not be the case that we guessed it right. Let's learn how we find the correct location of the centroid of our clusters in step 3.

Step 3

Now, look at the image just above. Do you think that our red centroid is at the right place with respect to the other points?

Centroid should be at the centre of all the data points present in a cluster. If we find out the average of all the data points in a cluster, we will get a new location. That location is marked with a “X” in the image below:

By Author

You see, the average location of all the red points is marked with a green cross, and the average position of all the blue points is marked with a yellow cross. These are the points at the centre of our clustered data and a better location to place our centroid. If we shift our centroid, it will look something like:

By Author

Step 4

But as we shifted our centroid, it may not be the case that all the red points are still close to the red centroid. We will again need to find the belongingness of each data point with respect to the closure centroids. It will look something like this:

By Author

Now we again follow from the second step taking these new centroids, and we keep on doing this until our centroid stops changing position. Once this happens, we can confidently say that we have found the centroids, and the data points that belong to this will fall under that cluster. This whole process can be understood visually by following the images below:

By Author

We will get the final result as:

By Author

We can see the new average is the same as the centroid and it won't change in more iteration, it gets fixed, this is the required centroid that we get and we can visually see that it successfully depicts what we assumed that is we have two clusters but the question that still remains is:-

Is it really two clusters in the image? This question is still unanswered, we will be moving to that in a while.

First, let's see the above process in form of a GIF:

By Author

We have two clusters now, and we know which data point belongs to which of the clusters. Now suppose we want to understand how our data is spread around our centroid. The best way to do this would be to sum the squared distance of all the data points to their cluster centroid, and this value for our data comes out to be 5.179687509974783.

This number is also termed “Inertia”, inertia is calculated on the best positioning of the cluster centroids. That means that we calculate this inertia on the centroid found after iterations and when we are sure about the centroids we got.

Learning to find the “K” in the K-Means:

While learning to find the cluster centroids for our data we assumed that there are 2 clusters in our data but this may be wrong as we do not have any mathematical backing to that guess, let's learn how to find the correct number of clusters.

We start guessing to find the number of clusters or the categories in our dataset. We make guesses on the number of clusters present in our data. Now suppose we guess that the clusters in our data are in between the range 1 to 15. Now how to pick the best guess?

We pick the best guess by finding the placements of centroid which gives minimum inertia for different values of K(Read ahead to get the hang of it).

Lets understand this visually:

Suppose we assume that we have only one category, thus we say that there must be one cluster only but as we know that clusters have a centroid point thus this cluster must also have a centroid point, now in order to validate if k=1 is an optimal choice we try finding the optimal centroid of the cluster. To find the optimal centroid of the cluster we use the strategy used above and reach to the result:

By Author

The inertia is the sum of square distances of the data points in the cluster to the cluster's centroid. The inertia for k=1 comes out to be 28.391514358368717 and while learning about the centroid location we found out the inertia for k=2, which came out to be 5.179687509974783 for k=3 it comes out to be 1.7050986081225123. As we increase the number of k, our loss keeps on decreasing.

It only seems logical that once our k becomes equal to the number of data points we have, it will get zero as there will be no error(Think about this point for a while, it can be a brain scratcher),

So if we plot it on the graph of inertia VS K (no. of clusters), it will look something like this:

By Author

It comes out as expected, notice how the graph shifts from exponential to linear, now the point where this shift happens is called as ELBOW point and this comes out to be the number of clusters present in our data, this means our assumption that the data contains 2 categories was wrong. The correct number of clusters is 3. Let's see where our cluster centroids lie, let's visualise it:

By Author

Conclusion

In this blog, we saw how the K-Means clustering algorithm works on a dataset and what should be taken care of while choosing the right number of clusters using elbow plot and how does k_means algorithm finds the centroids of the clusters formed.

Unsupervised Machine Learning is used to find out how the unlabelled data points are related to other data points and how they are clustered formed using that information, The main things to find in a clustering algorithm are the cluster centroid and the number of clusters there actually are which we can find in different ways one of which is explained above that is the elbow plot.

There are numerous applications of Unsupervised Machine learning like fraud detection, customer segmentation, feature extraction etc.

In previous blogs, we have covered, Supervised Machine Learning, and this blog explains the Unsupervised Machine Learning algorithm, The next blog will introduce you to the important parameters we need to take care of while working on the Machine Learning model.

Previous Blog — Your Guide to Supervised Machine Learning-Classification

--

--

Harshit Yadav
CodeX
Writer for

A student of IIT-Roorkee enthusiastic about Bockchain, DeFI, Machine Learning and Artificial Intelligence, learning, implementing and sharing the knowledge.