Intuition Behind Clustering in Unsupervised Machine Learning

What Is Clustering In Unsupervised machine learning ? How Does It Work?

@pramodchandrayan
Predict
8 min readAug 1, 2020

--

When it comes to analyzing & making sense of the data from the past and understanding the future world based on those data , we rely on machine learning methodologies . This field of machine learning as I have discussed in my past articles on machine learning fundamentals is broadly categorized into

  • Supervised Machine Learning
  • Unsupervised Machine Learning

To understand supervised ML please visit :

Clustering : The World Of Unsupervised Machine Learning

Today, will dig deeper into the world of Unsupervised learning. To help you catch the concept , let me put up the example of e-Commerce portals like Flipkart, Amazon etc.

Do you know how these eCommerce giants which you use everyday, manages to segment huge list of products into various categories with an intelligence which customizes the experience of browsing based on how you navigate on their portal .

These tailor made intelligence to categorize the products is made possible by one of the popular Unsupervised learning techniques called clustering , where they group the set of customers based on their behavior and try to make sense of the data points generated by those segments of user, to offer tailor made services.

So, some of the popular examples are :

  • Market segmentation
  • Product Segmentation
  • User segmentation
  • Organizing the system files into group of folders
  • Organizing emails into different folder category etc..

Why it is called unsupervised ?

Because in this field of Machine learning the data set provided to train the ML models doesn’t have any pre-defined set of labels/outcome defined with-in the data , so the prediction or segmentation of data has to be done to group the set of people, product or data into a cluster by the model itself.

For Example :

In case of problem where you are given the set of past data from the bank which has the list of user attributes along with one target column attributes which labels the user as

  • Defaulter
  • Non-Defaulter

Now our models has to be trained on these data with a known target to achieve as a result which is to predict whether any user which comes int the loan disbursal system will default or not is a kind of Supervised Machine learning model .

But What if you had the data which has no such kind of target column available and your model has to group the customers into a set of defaulters and non-defaulter , well when your model is trained to perform these kind of segmentation it is known to be an Unsupervised learning model.

So, with this basic understanding of unsupervised learning it’s time to get into the fundamentals of Clustering which is a kind of unsupervised learning . Here we will cover :

  • What Is Clustering In Unsupervised ML ?
  • What Are The Types Of Clustering?
  • What Is K-Means Clustering ?

What Is Clustering ?

It is a mechanism of grouping the set of given data to create a segments based on the concept of similarity among those data points. The intuition behind the concept of similarity comes from the word called distance .

What Is Cluster?

It is a collection of data object which are similar

So, it is important here to understand two highlighted world in the definition above

  • Similarity
  • Distance

The Concept of Similarity In Clustering :

In cluster analysis , we stress on the concept of data point similarity, where similarity is a measure of distance between those given data points .

Those distance to measure how close the given data points are used to infer how similar those data points . Some of the popular distance measuring techniques are

  • Manhattan Distance
  • Euclidean Distances
  • Chebyshev Distances
  • Minkowski Distance

Euclidean Distance :

Is probably the most common measure of distance we all are very familiar with in data science or mathematical world.

As per wiki,

In the field of mathematics, the Euclidean distance or Euclidean metric is the “ordinary” straight-line distance between two points in Euclidean space.

The Euclidean distance between points X and Y is the length of the line segment connecting then, In Cartesian coordinates, Euclidean distance (d) :

from X to Y, or from Y to X is given by the Pythagorean formula:

Euclidean Distance : 2 Dimension, 3 Dimension & N- Dimension :

Euclidean distance as discussed used the popular Pythagorean theorem to calculate the measure of distance between the given set of vectors/points in n dimensional space.

Below are the formula for the same in 2, 3 and n- dimensional space :

Manhattan Distance :

Unlike Euclidean distance, where we calculated the sum of the squares of the given vector points, here the distance between two points is the

sum of the absolute differences of their Cartesian coordinates.

This metric of distance is also known as snake distance, city block distance, or Manhattan length, This names has taken inspiration form the grid layout of most streets on the island of Manhattan, which causes the shortest path a car could take between two intersections in the borough to have length equal to the intersections’ distance in taxicab kind of geometry

Manhattan distance which is also called a taxicab distance can be defined by the below given formula’s

Chebysev Distance:

Also popularly called as Chess Board distance :

It is nothing but the Max(Of Manhattan Distance )

As per wiki,

In mathematics, Chebyshev distance (or Tchebychev distance), maximum metric, is a metric defined on a vector space where the distance between two vectors is the greatest of their differences along any coordinate dimension. It is named after Pafnuty Chebyshev.

It is also known as chessboard distance, since in the game of chess the minimum number of moves needed by a king to go from one square on a chessboard to another equals the Chebyshev distance between the centers of the squares, if the squares have side length one, as represented in 2-D spatial coordinates with axes aligned to the edges of the board.

So , for two vectors or points x and y, with standard coordinates xi and yi respectively, is given in the below figure. Also for 2 dimensional plane, we can see the formula below.

So now that we have understood the fundamentals of similarity based on measure of distance , its time to know what are the types of clustering and how do they make use of the above discussed distance metric to cluster the given vectors of data or an object .

Types Of Clustering In Unsupervised Learning :

There are basically two major categorization of clustering in the field of unsupervised learning

  • Connectivity based clustering : Also known as Hierarchical clustering
  • Centroid Based Clustering : K-Means being the most popular kind

Connectivity Based Clustering :

For a tabular dataframe with N no of columns and rows, if we calculate the distance between every pair of an object in a row to find which of those are closely related or similar, to be further clustered together, we call this expensive mechanism of clustering as connectivity based clustering . The intuition behind this extensive approach is;

That objects being more related to nearby objects than to the objects which are farther away

When the size of the data set is not very large this kind of clustering is very effective , but if data set is too big , this can be really resource intensive. For example , if we have a data set with 1000 rows than it will lead of 1/2 a million pairs of data to be analysed for similarity , this could be extremely costly to process. Imagine if the no of rows becomes 10,000.

So to sum up :

These connectivity based algorithms connect “objects” to form “clusters” based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name “hierarchical clustering” comes from, these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances

I have covered hierarchical based connectivity clustering in detail in one of my article linked below, do take some time to understand the same in more depth.

Centroid Based Clustering :

Unlike hierarchical/connectivity based clustering Centroid-based clustering organizes the data into non-hierarchical clusters.

Intuition Behind Centroid Based Clustering :

Here we get the pre-defined number of clusters at the outset .So, Instead of visiting each and every pair of object in n no of rows to calculate the distance , this algorithm requires you to define what are no of clusters we want to obtain , based on that centroid of those clusters are identified and the distance of the data points are calculated with respect to those identified centroids.

This algorithm is very cheap as compared to hierarchical clustering, which can be understood by the example. So if you had 1000 rows and 5 clusters are defined at the outset . The algo has to process only 5*1000= 5000 data points , which would have been 1/2 million data points in the case of connectivity based clustering algorithm.

How does we come No of cluster ?

We will answer this question when we uncover K-means clustering , but to ponder , it is related to popular method known as Elbow Method .

K-Means Clustering :

k-means is the most widely-used centroid-based clustering algorithm. Centroid-based algorithms are efficient but sensitive to initial conditions and outliers. We will get into the details of K-Means clustering in the next part of this series of unsupervised learning , where we will cover

  • What Is K-Means Clustering ?
  • How does it work ?
  • Implementing k-means clustering algorithm using hands-on python lab

--

--

@pramodchandrayan
Predict

Building @krishaq: an Agritech startup committed to revive farming, farmers and our ecology | Writes often about agriculture, climate change & technology