K-Means Clustering using Python from scratch

S Joel Franklin
Analytics Vidhya
Published in
3 min readNov 11, 2019
Photo by Maria Shanina on Unsplash

K-means clustering is an unsupervised learning algorithm which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest centroid. The algorithm aims to minimize the squared Euclidean distances between the observation and the centroid of cluster to which it belongs.

Before diving into the code we look at the step-by-step approach :-

Step 1 :- First we need to decide upon the number of clusters. In some cases like market segmentation, the value of ‘k’ (number of clusters) is defined in the problem statement. In other cases, we have to decide the optimum value of ‘k’. The elbow method helps to select the optimum value of ‘k’ by fitting the model with a range of values of ‘k’. We will discuss it another article. Here we assume the value of ‘k’ is 3.

Step 2 :- Then we randomly initialize ‘k’ number of centroids from the data points given.

Step 3 :- We calculate the square of euclidean distances between centroids and data points. The data point is assigned to the kth cluster if square of euclidean distance between the data point and the centroid of kth cluster is minimum.

Step 4 :- Now that we know the data points in each cluster, the centroid of each cluster is calculated.

Step 5 :- We calculate ‘epsilon’ which is defined as the sum of squares of euclidean distances between data points and its respective centroid.

Step 6 :- We iterate the entire process until there is no significant change in the value of ‘Epsilon’ suggesting there is no significant change in cluster formation.

Now let us understand the python code of K-means Clustering of 2 dimensional data into 3 clusters.

We get the following scatter plots.

We observe that there is no significant change in cluster formation after 3rd iteration as the 3rd, 4th and 5th scatter plots are very similar to each other.

Now let us plot the graph of ‘number of iterations vs square root of epsilon’.

a = list(range(1,21)) #Number of iterations = 20
plt.plot(a, [x**0.5 for x in epsilon], ‘go — ‘, linewidth=1.5, markersize=4)
plt.xlabel(‘Iteration number’)
plt.ylabel(‘Square root of epsilon’)

From the above graph, we again observe that there is no significant change in value of square root of epsilon after 3rd iteration. Hence we can stop after 3–4 iterations.

--

--

S Joel Franklin
Analytics Vidhya

Data Scientist | Fitness enthusiast | Avid traveller | Happy Learning