What Is Clustering and Common Clustering Algorithms ?

Anam Fatima
The Startup
Published in
5 min readJan 23, 2021

A beginner’s guide to Clustering Algorithm

WHAT IS CLUSTERING ?

As the name suggests, it involves dividing the data points into groups and each group consist of similar data points. In theory, data points that are in the same group should have similar properties, while data points in different groups should have highly dissimilar properties. Clustering is an unsupervised learning problem, it deals with finding a structure in collection of unlabeled data.

WHY CLUSTERING?

The purpose of clustering is to make sense of and extract values from large sets of structured and unstructured data. It helps you to glance through the data to pull out some patterns and structures before going deeper into nuts-and-bolts analysis. For example, clustering can be used in the field of medical science and can also be used in customer classification in marketing research.

COMMON CLUSTERING ALGORITHMS

  1. K-Means Clustering it is clustering algorithm whose main goal is to similar elements or data points into a cluster, where K in K-Means represents the number of clusters. It is by far the most popular clustering algorithm given it is easy to understand and apply to a wide range of machine learning problems. 1) The first step is to choose the number of clusters ‘k’ after that the main idea is to define k centroids one for each cluster. It is important to define centroids as far off from each other as possible so that variation is less. K-Mean++ is used as a standard initialization algorithm and it generally works better than Forgy Initialization and Random Partition Initialization. 2) After all the centroids are defined, the next step is to take each point of the given data set and associate it to the nearest centroid. Once all data points are assigned to respective clusters at this point we need to calculate the central points or new centroids for each cluster and the originally randomly allocated centroid is repositioned to the actual centroid of clusters. 3) Now again, all the data points are rearranged and associated to the newly defined centroids. This process is repeated until the centroids stop moving from their positions.
Animation of K-Means Clustering

2. Hierarchical Clustering it bridges the gap made by k means clustering, it takes away the problem of having to pre define the number of clusters at the beginning of the model. Hierarchical clustering, as the name suggest is an algorithm that builds hierarchy of clusters. There are mainly two types of Hierarchical clustering :

  1. Agglomerative Clustering begins with each data point as a separate cluster and at each iteration we merge the closest pair of clusters and repeat this step until only a single cluster is left. Basically it uses a bottom-up approach. In hierarchical clustering we have concept called as proximity matrix which stores the distance between each points. There are multiple distance metrics which can be used for deciding the closeness of two clusters.
  • Euclidean distance: distance((x, y), (a, b)) = √(x — a)² + (y — b)²
  • Manhattan Distance: distance (x1, y1) and (x2, y2) =
    |x1 — x2| + |y1 — y2|

2. Divisive Clustering or the top bottom approach works in the opposite way. Instead of starting with n clusters ( in case of n observations), it starts with a single cluster and assign all the points to that cluster. Therefore, it doesn’t matter if we have 10 0r 1000 data points, all these data points will belong to the same cluster at the beginning. Now at each iteration, we split the farthest point in the cluster and repeat the process until each cluster contains only a single data point.

The above image shows both type of hierarchical clustering

Dendrogram is another concept in hierarchical clustering, it is a tree like diagram that records the sequences of merges or splits. Whenever two clusters or data points are merged, we will join them in the dendrogram and the height of the join will be the distance between these points. Now, we can set a threshold distance and draw a horizontal line, the number of cluster will be the number of vertical lines which are being intersected by the line drawn using threshold.

Image above shows the number of cluster using dendrogram

3. Density-Based Spatial Clustering Of Application With Noise or DBSCAN is not just able to cluster the data points correctly but it also perfectly detects noise in the dataset. It groups densely grouped data points in the form of concentric circles and the most important feature is that it is robust to outliers. DBSCAN clustering algorithm is used to find associations and structures in data that are hard to find manually.

Now the first question which comes to our mind is why do we need DBSCAN Clustering ? Answer — K-Means and Hierarchical clustering both fail in creating clusters of arbitrary shapes, they are not able to form cluster based on varying densities.

Difference between DBSCAN and K-Means Clustering

DBSCAN clustering requires only two parameters :

  • eps: it is the radius of the circle to be created around each data point to check the density.
  • minPoints: it is the minimum number of data points required inside the circle which is created around every data point.

DBSCAN clustering consist of creating circle of epsilon radius around every data point and classifies them into Core point, Border point and Noise. A data point is a Core point if the circle around it contains at least ‘minPoints’ number of points. I f the no of points present inside the circle is less than ‘minPoints’ then the point is classified as Border point. And if there is no data point around any data point, it is treated as Noise.

That’s It!

There is still a lot more to cover here but I promise to add more useful concepts.

This is my first article and I hope this blog post has been useful! If there is anything that you guys would like to add to this article, feel free to leave a message and don’t hesitate! Any sort of feedback is truly appreciated.

--

--