What is Clustering?

Introduction: Ever thought of arranging data based on similar features without having the actual labels/classes/targets for the data. This article will provide you with complete knowledge of this thing with all the possible ways and their practical examples.

Harshit Dawar
Analytics Vidhya
3 min readMay 22, 2020

--

Source: heyerlein via unsplash

Clustering is a Machine Learning algorithm. It falls under unsupervised machine learning algorithms.

Unsupervised Machine Learning

These are the category of algorithms in which we have only features for data i.e. labels/classes/targets are not available in the data.

We can also say that it is not known that for a particular record of the data, where it should belong. This is the main significance of this category of machine learning algorithms.

Clustering Explanation

As it is clear till now that it falls under unsupervised machine learning algorithms, so obviously we are not having the target classes for our data.

These algorithms work with the goal of making a few groups/clusters of the data by the similarity between them or finding some patterns between the data.

In a nutshell, clustering will make several clusters, & data present in each cluster will be having the utmost similarity, & the data present between different clusters will be having the least similarity.

From the above explanation, it can be concluded that the ultimate goal of the clustering is to minimize the Intra-cluster distance & maximize the inter-cluster distance.

Types of Clustering

  1. Partition based Clustering
  2. Hierarchical Clustering
  3. Density-based Clustering

Partition based Clustering

  • These types of clustering algorithms generate Sphere like clusters.
  • They are relatively efficient.
  • Used for Medium or Large size Databases.
  • Examples: K-Means, Fuzzy C-Means, K-Median.

Hierarchical Clustering

  • These are the algorithms that generate trees of clusters and group similar data.
  • Very Intuitive Algorithms.
  • Generally good to use with small-sized datasets.
  • Example: Agglomerative, Divisive.

Density-based Clustering

  • They produces clusters with arbitrary shape.
  • They are excellent to use when there is no noise in the dataset.
  • Example: DBScan Algorithm.

Use Cases of Clustering

In Retail/Marketing:

  • Identifying buying patterns of customers.
  • Recommending new movies to customers.
  • Recommending new gadgets to customers, etc.

Banking:

  • Identifying a set of customers. (Eg, loyal, churn, etc.)
  • Fraud Detection, etc.

Insurance:

  • Fraud detection in claim analysis etc.

Publication:

  • Automatic categorizing of the news based on the content of the news.
  • Recommending similar news articles.
  • Identifying a set of readers. (Eg, loyal, churn, etc.)

Medicine:

  • Characterising Patient Behaviour for the effect of medicine.
  • Identifying similar drugs by clustering them.

Biology:

  • Clustering genetic markers to identify family ties/ family generation.
  • Identifying a particular species.

Clustering VS Classification

The most significant difference between them is that, in classification, for each record we have corresponding label, but in clustering, we do not have labels at all.

I hope my article explains each and everything related to clustering along with its use cases. Thank you so much for investing your time in my articles and boosting your knowledge!

--

--

Harshit Dawar
Analytics Vidhya

AIOPS Engineer, have a demonstrated history of delivering large and complex projects. 14x Globally Certified. Rare & authentic content publisher.