K-means Clustering

Aditri Srivastava
Analytics Vidhya
Published in
4 min readJan 17, 2021

--

From scratch explanation and implementation

K-means Clustering is an unsupervised machine learning technique. It aims to partition n observations into k clusters.

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into K distinct non-overlapping subgroups (clusters) where each data point belongs to only one group

Source : Wikipedia

GOAL → Partition data among k number of clusters, where k is known to us.

Procedure of kmeans Algorithm

  1. Specify the number of clusters K
  2. Randomly initialize K centres (centroids) .
  3. Iterating over the dataset till we reach maximum number of iterations or before if there is change in centroids of clusters.
  4. Calculating Euclidian distance of each point and from every cluster centre. Assign each point to a cluster whosoever distance is minimum. This step is called E-step (Expectation step). {K-Means is a special case of Expectation Algorithm}
  5. Compute the mean of every cluster and then update it with centroid of that cluster. This step is called M-step(Maximization step)
  6. Repeat these till we reach maximum iteration or to our correct centroid.

K-means can be stated as E-M (Expectation-Maximization) algorithm. E-step populates points while M-step updates centre.

Maths Behind K-means

As we have seen in other Machine learning Algorithms, we have a loss function. So we it here also called as objective function or inertia.

We introduce wik here as we are going to add distance for a cluster that point belongs not other clusters.

We first minimize J with respect to wik keeping μk fixed. Then we minimize J w.r.t. μk and treat wik fixed.

When we are differentiating J w.r.t. wik first and update assignments of cluster which is E-step. Then we differentiate J w.r.t. μk and update the centroids which is called M-step.

E-step is

M-step is

This is same as us taking mean as denominator says the summation of 1’s and 0’s that is number of points in kth cluster. while numerator is sum of values of those points in that cluster.

This method of minimizing error and finding wik and μk is called Coordinate Descent which has a graph like this

Source : Wikipedia

CODE for K-means

# Distance between two points Euclidian Distance
def distance(a1,a2):
return np.sqrt(np.sum((a1-a2)**2))
class KMeans():
def __init__(self,k=5,max_iter=100):
self.k= k
self.max_iter=max_iter
self.clusters = {}
self.label = []

def initialization(self,X):
for i in range(self.k):
center=np.zeros((2,))
cx = np.random.uniform(low=38.8250, high=39.0000)
cy = np.random.uniform(low=-76.9, high=-77.15)
center[0] = cx
center[1] = cy
#center = 10*(2*np.random.random((X.shape[1],))-1)
points = []
cluster = {
'center':center,
'points':points,
'id' :i
}
self.clusters[i]=cluster
self.label = np.zeros((X.shape[0],1))

def assignPointTOClusters(self,X):
for i in range(X.shape[0]):
dist = []
curr_x = X[i]

for ki in range(self.k):
d = distance(curr_x,self.clusters[ki]['center'])
dist.append(d)

current_cluster = np.argmin(dist)
self.clusters[current_cluster]['points'].append(curr_x)
self.label[i]=(self.clusters[current_cluster]['id'])

def check(self,old_c,new_c):
distances = [distance(old_c[i], new_c[i]) for i in range(self.k)]
return sum(distances) == 0

def updateClusters(self):
for kx in range(self.k):
pts = np.array(self.clusters[kx]['points'])

if pts.shape[0]>0: # If cluster has some nonzero points
new_u = pts.mean(axis=0)
self.clusters[kx]['center'] = new_u
# Clear the list
self.clusters[kx]['points'] = []

def plotClusters(self):
for kx in range(self.k):
print(len(self.clusters[kx]['points']))
pts = np.array(self.clusters[kx]['points'])
# plot points , cluster center
try:
plt.scatter(pts[:,0],pts[:,1])
except:
pass
uk = self.clusters[kx]['center']
plt.scatter(uk[0],uk[1],color='black',marker="*")
plt.show()

def fit(self,X):
print(self.k)
self.initialization(X)
for i in range(self.max_iter):
print("i is ",i)
self.assignPointTOClusters(X)
self.plotClusters()
old_c = [self.clusters[i]['center'] for i in range(self.k)]
self.updateClusters()
new_c = [self.clusters[i]['center'] for i in range(self.k)]
if self.check(old_c,new_c):
break

K-Means++ : Just a Introduction

  • K-Means is quite sensitive to initalization, if the init is not good, our algorithm is not able to make desired number of clusters.
  • To overcome this problem, we use technique called K-Means++ (described in paper Robust Seed Selection for K-Means type of Algorithms) which chooses initial centers so that they are statiscallly close to final ones.

--

--