The Most Common Clustering Algorithm for Data Science and Their Code

Bhanwar Saini
The Startup
Published in
7 min readFeb 11, 2021

--

In supervised learning, we know the labels of the data points and their distribution. However, the labels may not always be known. Clustering is the practice of assigning labels to unlabeled data using the patterns that exist in it. Clustering can either be semi-parametric or probabilistic.

1. K-Means Clustering:

K-Means Clustering is an iterative algorithm that starts with k random numbers used as mean values to define clusters. Data points belong to the cluster defined by the mean value to which they are closest. This mean value co-ordinate is called the centroid.

Iteratively, the mean value of the data points of each cluster is computed and the new mean values are used to restart the process till the mean stops changing. The disadvantage of K-Means is that it a local search procedure and could miss global patterns.

The k initial centroids can be randomly selected. Another approach of determining k is to compute the mean of the entire dataset and add k random co-ordinates to it to make k initial points. Another approach is to determine the principal component of the data and divide it into k equal partitions. The mean of each partition can be used as initial centroids.

--

--