k-Means Clustering Explained (Part I: Theory)

n0obcoder
4 min readJun 15, 2019

--

Photo by Henry Lorenzatto on Unsplash

What is Clustering ?

Pogo visits his friend’s place. His friend, Deepi, is an anthophile (a person who loves flowers). Deepi has a big pile of flowers in his backyard, all of them belonging to the same species, but of different colors!

Deepi asks Pogo if he can create Clusters out of the pile of flowers lying in his background. Pogo looks at the pile of flowers. He sees that all of the flowers look almost the same. They have very similar characteristics in all but one aspect, their color!

So Pogo simply puts all the flowers, having the same color, in a separate basket and goes on to enjoy some chilled beer with his buddy Deepi.

So what exactly happened here ?

Pogo basically performed the task of grouping a set of objects (flowers in his case) in such a way that the objects in the same group (called a Cluster) share some common features (color being the common feature in his case) with each other than to those in the other groups.

I hope that you now have got a decent understanding of what Clustering means.

Little more formally, Clustering is an unsupervised Machine Learning algorithm that involves grouping of data points. In theory, data points falling in the same cluster should have similar characteristics/features, while data points falling in different clusters should have highly dissimilar properties/features.

In Data Science, we use clustering analysis to get some valuable insights from our data by seeing what groups the data points fall into when we apply a clustering algorithm. Clustering reveals the underlying pattern or the structure of the data points, which otherwise would go unnoticed.

There are various clustering algorithms out there. But I am going to talk about the first ever clustering algorithm that I had come across, and it’s a very simple one. Its called the k-Means Clustering.

k-Means Clustering

It is one of the simplest and easy to implement unsupervised learning algorithms that solves the problem of clustering.

Steps involved in k-Means Clustering Algorithm

  1. We start with a value of k, the number of clusters/classes/groups that we want to create in the given dataset and we initialize the k cluster centers in the n-dimensional feature space. It’s good to take a look at the data and try to come up with the number of distinct clusters, k. Note that all the k cluster centers are n-dimensional vectors, in the n-dimensional feature space.
  2. Each data point is classified to one of the k cluster centers, by computing it’s distance with all the k cluster centers, and then classifying the point to the cluster whose center is closest to it.
  3. Based on these classified points, we recompute the group center by taking the mean of all the n-dimensional vectors in the group.
  4. We repeat these steps until the cluster centers don’t change much (untill the cluster centers converge).

Advantages

  1. k-Means is really fast. Computing the distances between the data points and the cluster centers is all that we are doing here. It has a linear complexity O(n)
  2. It’s very easy to understand and implement from scratch using just Numpy.

Disadvantages

  1. We have to select the number of clusters k ,that we want to be created in the dataset. This is not a very simple task and ideally we would want the clustering algorithm to figure that out for us.
  2. k-Means starts off by initializing the k cluster centers randomly. It therefore, has chances to yield different cluster outputs every time we run the algorithm. Other clustering algorithms are more consistent.
  3. Since this algorithm involves taking means of all the n-dimensional data points falling in a group to calculate the new cluster center, it is sensitive to outliers.

    This problem can be solved by using the k-Medians clustering algorithm, where we take medians and not means of all the n-dimensional data points falling in a group to calculate the new cluster center. This method is not sensitive to outliers as we are using medians here. But this method is slower for larger datasets as sorting is also required on each iteration when computing the n-dimensional median vector.
  4. This learning algorithm is not invariant to non-linear transformations i.e. with different representations of data we get different results (data represented in form of cartesian co-ordinates and polar co-ordinates will give different results).
Get Your Hands Dirty in Part II

If you guys want to get your hands dirty, you should definitely check out k-Means Clustering Explained (Part II: Python Implementation)

I am writing this blog because I have learnt a lot by reading other’s blogs and I feel that I should also write and share whatever I know as much as I can. So please leave your feedback down below in the comments section. Also I am new to writing blogs, so any suggestions on how to improve my writing would be appreciated ! :D

Also I am an independent music artist who likes to play and record music in free-time. Maybe you can check out my artist page on Spotify and show your support : )
8th Floor Harmony on Spotify!

--

--

n0obcoder

DL Engineering in the making and a Struggling Musician