Understand clustering in data science without mathematics

shubham badaya
2 min readDec 31, 2023

--

Photo by Mel Poole on Unsplash

Frequently, we find ourselves questioning the nature of clustering, its distinctions from classification, and the general methods employed in clustering. In this article, I aim to unravel these queries specifically for beginners. For in-depth mathematical details, I recommend consulting books. However, to cultivate an intuitive understanding of clustering, please turn to this article.

In this article, we will try to understand the following:

  1. What is clustering?
  2. How it is different from classification problems.
  3. How clustering is done?

What is Clustering?

Clustering is a technique used in data analysis to group similar objects or data points based on certain characteristics or attributes.

For example suppose, you have a set of marbles of various colors like blue, green red. You want to cluster them based on color. So, green ones will form Group 1, red ones will be in Group 2, yellow ones will be in Group 3, and blue ones will be in Group 4.

Example 1 (source: https://www.baeldung.com/java-k-means-clustering-algorithm)

In the above example, the objects are marbles and the characteristic is color.

How clustering is different from classification?

Clustering is about finding patterns in data without any prior knowledge, while classification is about predicting labels, categories, or groups.

In clustering, a complete dataset(i.e. there is no concept of training and test dataset) is used to find groups, whereas in classification the model is trained first on a training dataset with groups already known, and new observation labels are predicted later.

How clustering is done?

As explained earlier, we cluster data based on characteristics. In the above explanation, it was color for us. Now how algorithm understand if 2 data points are the same or not?

Example 2: Source (https://statisticsbyjim.com/basics/k-means-clustering/)

Consider the data above in the example 2. There are multiple companies in the dataset. Each company or row will be considered a vector. The distance between each row is calculated the same way the distance between 2 vectors is calculated. Each feature/attribute will be a dimension. Hence this is a 4-D problem. Closest rows will be clubbed together to form clusters. In short, a simple distance metric will be used to calculate the similarity between 2 pairs.

This distance metric will differ depending on the problem statement. The most preferred one is Euclidean, however, there are plenty of such metrics.

--

--