Master KMeans clustering basics

What is clustering?

Published in

Analytics Vidhya

4 min readAug 16, 2020

Clustering is part of an unsupervised algorithm in machine learning. Unlike supervised algorithms like linear regression, logistic regression, etc, clustering works with unlabeled data or data without target variables.

The task of clustering is to group similar data points.

Types of Clustering:

Clustering comes under the data mining topic and there is a lot of research going on in this field and there exist many clustering algorithms.

The following are the main types of clustering algorithms.

K-Means
Hierarchical clustering
DBSCAN

Applications of Clustering:

Following are some of the applications of clustering

Customer Segmentation: This is one of the most important use-cases of clustering in the sales and marketing domain. Here the aim is to group people or customers based on some similarities so that they can come up with different action items for the people in different groups. One example could be, amazon giving different offers to different people based on their buying patterns.
Image Segmentation: Clustering is used in image segmentation where similar image pixels are grouped together. Pixels of different objects in the image are grouped together.

3. Pre-processing steps: Supervised machine learning algorithms are more robust and interpretable than unsupervised algorithms but you can’t use supervised algorithms unless you have labeled data. In such cases Clustering is used as a pre-processing step where clustering is used to group unlabeled data and then labeled are assigned to it and then this data can be used for supervised algorithms.

K-means:

K- means also called Lloyd’s algorithm.

Steps in K-means:

Initialize k centroid randomly.
Assign each point to its nearest centroid.
Re-compute all k centroids
repeat steps 2 and 3 until centroid does not change.

The centroid is a point at the center of the cluster.

Distance Functions:

At step 2, each point is assigned yo it’s the nearest centroid, now the obvious question is how model find out which is the nearest centroid?

There are many ways to calculate the distance between data point and centroid and some of the distance functions are the following:

Intra-Cluster Distance: Distance between points within the same cluster.
Inter-Cluster Distance: Distance between points within a Different cluster

How to select the best K?

K is the hyper-parameter in the K-mean algorithm which we have to provide before the model starts training.

Inertia: This is the mean squared distance between each data point and its closest centroid.

Run the model with a different value of K and check the value of inertia, smaller the value of Inertia better the value of K.

Silhouette Metrics:

b: Inter-cluster distance.
a: intra-cluster distance.

1≤s≤1: Higher the Silhouette score, better the cluster.

How to Initialize K?

We have seen how to select the best value for K using either Inertia or Silhouette score. Once we select the best value of K, the next question is how to initialize it?

Random Initialization: In this, each data point has an equal probability of getting selected as a centroid.
K-means++: In this,1st centroid is selected randomly from given data points and the next centroid gets selected based on its distance from other centroids. Higher the distance of the data point from the centroid, the higher the probability of getting selected as the next centroids.

Limitations of K-means:

It does not work that well if natural clusters in the original data are of different sizes.

2. It does not work that well if natural clusters in the original data are of different density.

3. It does not work that well if natural clusters in the original data are of non-spherical shape.

Hierarchical Clustering:

One of the main issues with the K-means algorithm is that we have to provide the value of K to the model before training starts. To overcome that hierarchical clustering is used.

K-Mode & K-prototype:

K-means algorithm works with numerical data only. If you have categorical data only then use the K-mode algorithm. If you have both categorical and numerical data then use the k-prototype algorithm.

Thanks for your time. If you find this useful then please like and comment.

Happy reading!