Implementing a stupidly simple K Means in python

Mayank Satnalika
4 min readMar 8, 2017

--

K Means is generally one of the first algorithm one gets to know while studying unsupervised learning and it is a clustering algorithm.

What is clustering ? It is simply grouping of data points into categories. For Ex a general store can group it’s customers according to the age group and their gender, a multi cuisine restaurant clusters different items on the menu in different groups like Italian, Mexican or Chinese food and so on.

How does K Means work?

We take all our data points and randomly select k points out of them if k is the number of clusters to be formed. These points are known as centroid points.

while( we are not satisfied )

Step 1: Now we label each of the data points to a value between 1 and k based on the centroid the data point is closest to.

Step 2:

For each centroid point, modify it so that the new centroid point is the average of data points which has been assigned to that centroid.

It can be better explained by the animation below:

The ‘X’ denote the centroid points and all data points are grouped into red or green clusters based on their proximity from the cluster. Then the centroid point is moved to the average point( the centroid point) of the points assigned to the corresponding cluster.

For our titanic dataset, we are going to cluster the data into into clusters which hopefully will denote if they survived or not. For the sake of simplicty, lets take only 2 numercal features age and fare as they are can be easily viisualized. However, we can use as many features we want for K Means but visualizing 3d or more images involving more features will be ddifficult.

Importing the modules:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn import preprocessing

Reading and filling incomplete data:

data = pd.read_csv(‘train.csv’)
data = data[:20]
#visualising 900 data points will clutter visualisations.
max_clusters = 2
data[‘Age’].fillna(np.mean(data[‘Age’]), inplace = True)
data[‘Fare’].fillna(np.mean(data[‘Fare’]), inplace = True)

Preprocessing data:

data[‘Age’] = preprocessing.scale(data[‘Age’])
data[‘Fare’] = preprocessing.scale(data[‘Fare’])

Now lets get started with K Means part, first we randomly select 2 data points (because we need to divide our data in 2 clusters).

centroid_list = []
cols = [‘Age’,’Fare’]
for i in range(max_clusters):

OneDict = {}
randomIndex = random.randrange(0, data.shape[0])

for col in cols:
OneDict[col] = data.iloc[randomIndex][col]
centroid_list.append(OneDict)

Centroid_list is a list of dictionary. Each dictionary contains a key which is my column name and the value which denotes the value for that column. If there are k clusters, my centroid_list will have k dictionary elements. So my centroid_list contains 2 data points each having a age value and a fare value which denotes my randomly selected centroid points.

Now we start the repetitive part of the algroitm to bring the group the data points. We try to run the step 1 and 2 multiple times (max_iter times).

 max_iter =  7
iters = 0
while iters < max_iter:

for i in range(data.shape[0]):

# step 1 begins:
cluster_index = 0
mini = 9999
for centroid in centroid_list:
dist = np.linalg.norm( centroid.values() -
data[cols].iloc[i])
if dist < mini:
mini = dist
cluster_train[i] = cluster_index
cluster_index += 1

For every data point, we try to find it’s distance from every centroid using the function np.linalg.norm() and save it as dist. If the present dist is less than previous minimum distance, the data point is assigned to that centroid point. Lets plot the centroid points after each iteration:



colours = ["r.", "g."]
for i in range(data.shape[0]):
plt.plot(data['Age'].iloc[i], data['Fare'].iloc[i],
colours[cluster_train[i]], markersize = 10)


p1_1 = centroid_list[0]['Age']
p1_2 = centroid_list[0]['Fare']
p2_1 = centroid_list[1]['Age']
p2_2 = centroid_list[1]['Fare']
plt.scatter(p1_1,p1_2, color = 'red',marker = 'x',s = 130)
plt.scatter(p2_1,p2_2, color = 'green',marker = 'x',s =130)
plt.show()

For step 2 of the algorithm, for each centroid point in our list, we find the average of points assigned to that cluster index and chaneg the centroid point to that average.

   for ind in range(len(centroid_list)):

SumCentroidPt = {}
count = 0
for col in cols :
SumCentroidPt[col] = 0
for i in range(data.shape[0]):
if(cluster_train[i] == ind):
count = count + 1
for col in cols:
SumCentroidPt[col] = SumCentroidPt[col] +
data.iloc[i][col]


for col in cols:
SumCentroidPt[col] = SumCentroidPt[col]/count
centroid_list[ind]= SumCentroidPt

iters += 1

Remeber we plotted the centroids and the clusters assigned each time afetr we run step 1, here are they:

Well that’s it, our cluster_train array will contain the value of the cluster they are assigned to and centroid_list will contain the cluster points.

Do press the little heart button below if you liked the article :)

--

--