k-Means Clustering Explained (Part II: Python Implementation from Scratch !)

n0obcoder
5 min readJun 20, 2019

--

Photo by Maxim Dužij on Unsplash

If you are not aware of the inside of k-Means Clustering Algorithm, I would strongly recommend you guys to check k-Means Clustering Explained (Part I: Theory) out.

Once you guys have a decent understanding of this algorithm, we are good to dive into the implementation of this clustering algorithm using Python from scratch !

Like any other implementation tutorial, we start off with importing the libraries that we are going to use.

Next, we want a data generator function that could make fake data for us to cluster them. For the sake of simplicity, we are going to be working on 2-dimensional data. In general, 2D data is always simple to visualize. But there are a few people like Pogo and Deepi, who can even visualize 300-dimensional data 👽 Here is the reference if you guys din’t get the last line 😛

So we define a data generator function which would generate data points falling in the same cluster. We want the function to take in the center (2-dimensional coordinates) of this cluster and the number of data points to be created and return us the data points.

Next, we make an empty np array and fill it with data points falling in different clusters by calling the gen_data() with varying it’s parameters (cluster centers and number of data points).

Data points generated by calling get_gen( ) 3 times.

Now, we need to pick the value of k and initialize the k cluster centers vectors randomly. The cluster centers are bound to exist between the extremes of all the data points that we have. So we first find out the extremes of the data points.

Next, we define a function which would take in the k cluster center points and plot them together. This function would be very helpful when we call them after every iteration of the algorithm.

Let’s, plot the randomly initialized cluster centers on top of all the data points that we have created using the fen_data() function.

Initialization of k cluster centers(Red dots) randomly, where k = 3

We need a way to know which data points fall under which cluster . We can get this information by creating a list which would contain the cluster information of all the data points encoded in color names.

Initially we color all the data points ‘Gray’.

We need another helper function which would map a data point to one of the existing k cluster centers. A data point is assigned to a cluster whose center is nearest to that data point. Isn’t this so simple !

All we need to do is compute Euclidean distances between a data point and all the k cluster centers, for all the data points.

Once, all the data points are assigned a cluster, we need to recompute new cluster centers. And also we need to compare how much a cluster center has shifted.

We set the maximum number of iterations the algorithm would run for.

Also we need a way to stop the algorithm before running maximum number of iterations in case when all the cluster centers have converged(when they stop shifting much). We can do this by creating a list of length equal to the number of clusters k and fill it with False. At the end of every iteration, for every cluster center, we check if the shift is less than a certain threshold, we change the stop list to True for that particular cluster. And when all the elements of the stop list are True, we stop the algorithm.

See how the cluster centers (Red dots) move !

If you guys have any doubts regarding the theory or the code, please let me know in the comments section down below. I will try to respond and clarify your doubts as soon as possible :D

I hope that this blog was helpful to you.

You guys can find the whole code here on my github page.

I am writing this blog because I have learnt a lot by reading other’s blogs and I feel that I should also write and share whatever I know, as much as I can. So please leave your feedback in the comments section down below. Also I am new to writing blogs, so any suggestions on how to improve my writing would be appreciated ! :D

Also I am an independent music artist who likes to play and record music in free-time. Maybe you can check out my artist page on Spotify and show your support : )
8th Floor Harmony on Spotify!

--

--

n0obcoder

DL Engineering in the making and a Struggling Musician