A Friendly Introduction to K-Means clustering algorithm

Tarlan Ahadli
3 min readJan 7, 2020

--

K-means clustering is one of the most common unsupervised learning algorithms in Data Science. The method follows straightforward rules to classify unlabeled datasets which will be discussed in the article.

Before jumping into the main topic, let’s imagine that there’s a town in which the company wants to sell its new T-shirts. However, producing them to fit each individual body size is ineffective and risky. The product should be marketized to meet the requirements of the majority so the likelihood of sales is satisfying.

This is where data scientists come to play. His role in the company will be to generalize and classify the body sizes into the predetermined number of classes. It’s worth noting that the character ‘K’ in the term ‘k-means’ represents this number.

As usual, the first part of the problem is to collect as much data as possible. For educational purposes, we can type up a code to generate an artificial data set. But the technique can easily be applied to any real-world datasets that are structured similarly.

The next step is to initialize ‘K’ number of randomly picked points on the plane. Then we will iterate on each collected data point and find the distance to initialized points. The data point will belong to the initialized point to which the distance is minimum.

Finally, we’ll find the mean of widths and heights respectively, and set it as a new coordinate for the above-initialized points which are also called centroids.

This process will be repeated until the convergence is met which means there’s no change in the position of centroids. Thus, these centroids will represent the width and height of T-shirts that will be produced for the various class body sizes. That’s all!

How to choose K?

The main drawback of this technique is related to ambiguity about the K number of points that should be initialized. To overcome this issue, the performance of the algorithm is calculated for different numbers of centroids.

To apply the evaluation, once the convergence occurred, the distance between each cluster centroid and the data point is calculated. Then all the calculated distances summed as a measure of performance.

As the number of cluster centroids increases, the magnitude of the objective function will be less.

In the image below, the relationship between the ‘number of clusters K’ and the value of the objective function is depicted. To choose the optimal K, the elbow method is usually used. In the plot below, elbow-like shape happens when K equals 4.

Putting theory into practice using python

Conclusion

K-means is one of the most common and intuitive clustering algorithms in Machine Learning. The name ‘k-means’ almost explains the theory itself.

  1. ‘K’ number of data points is initialized.

2. The mean of the corresponding features of the nearest data points is calculated and set as a new coordinate of the pre-initialized centroids.

3. The process is repeated until there’s no change in the position of centroids.

If you have any questions about the discussed topic you can easily reach out to me on Twitter.

K-means clustering algorithm codes: Python, C++

--

--

Tarlan Ahadli

ML Engineer | Master of CS | Philosophy | Art | Short bios.