# K-means Clustering

From scratch explanation and implementation

K-means Clustering is an unsupervised machine learning technique. It aims to partition *n observations* into *k clusters.*

Kmeansalgorithm is an iterative algorithm that tries to partition the dataset intoKdistinct non-overlapping subgroups (clusters) where each data point belongs toonly one group

GOAL → Partition data among k number of clusters, where k is known to us.

# Procedure of kmeans Algorithm

- Specify the number of clusters
*K* - Randomly initialize K centres (centroids) .
- Iterating over the dataset till we reach maximum number of iterations or before if there is change in centroids of clusters.
- Calculating Euclidian distance of each point and from every cluster centre. Assign each point to a cluster whosoever distance is minimum. This step is called
**E-step (Expectation step). {***K-Means is a special case of Expectation Algorithm***}** - Compute the mean of every cluster and then update it with centroid of that cluster. This step is called
**M-step(Maximization step)** - Repeat these till we reach maximum iteration or to our correct centroid.

K-means can be stated as E-M (Expectation-Maximization) algorithm. E-step populates points while M-step updates centre.

## Maths Behind K-means

As we have seen in other Machine learning Algorithms, we have a loss function. So we it here also called as objective function or inertia.

We introduce wik here as we are going to add distance for a cluster that point belongs not other clusters.

We first minimize J with respect to wik keeping μk fixed. Then we minimize J w.r.t. μk and treat wik fixed.

When we are differentiating J w.r.t. wik first and update assignments of cluster which is **E-step**. Then we differentiate J w.r.t. μk and update the centroids which is called **M-step**.

E-step is

M-step is

This is same as us taking mean as denominator says the summation of 1’s and 0’s that is number of points in kth cluster. while numerator is sum of values of those points in that cluster.

This method of minimizing error and finding wik and μk is called Coordinate Descent which has a graph like this

CODE for K-means

# Distance between two points Euclidian Distance

def distance(a1,a2):

return np.sqrt(np.sum((a1-a2)**2))classKMeans():

def __init__(self,k=5,max_iter=100):

self.k= k

self.max_iter=max_iter

self.clusters = {}

self.label = []

definitialization(self,X):

for i in range(self.k):

center=np.zeros((2,))

cx = np.random.uniform(low=38.8250, high=39.0000)

cy = np.random.uniform(low=-76.9, high=-77.15)

center[0] = cx

center[1] = cy

#center = 10*(2*np.random.random((X.shape[1],))-1)

points = []

cluster = {

'center':center,

'points':points,

'id' :i

}

self.clusters[i]=cluster

self.label = np.zeros((X.shape[0],1))

defassignPointTOClusters(self,X):

for i in range(X.shape[0]):

dist = []

curr_x = X[i]

for ki in range(self.k):

d = distance(curr_x,self.clusters[ki]['center'])

dist.append(d)

current_cluster = np.argmin(dist)

self.clusters[current_cluster]['points'].append(curr_x)

self.label[i]=(self.clusters[current_cluster]['id'])

def check(self,old_c,new_c):

distances = [distance(old_c[i], new_c[i]) for i in range(self.k)]

return sum(distances) == 0

defupdateClusters(self):

for kx in range(self.k):

pts = np.array(self.clusters[kx]['points'])

if pts.shape[0]>0: # If cluster has some nonzero points

new_u = pts.mean(axis=0)

self.clusters[kx]['center'] = new_u

# Clear the list

self.clusters[kx]['points'] = []

defplotClusters(self):

for kx in range(self.k):

print(len(self.clusters[kx]['points']))

pts = np.array(self.clusters[kx]['points'])

# plot points , cluster center

try:

plt.scatter(pts[:,0],pts[:,1])

except:

pass

uk = self.clusters[kx]['center']

plt.scatter(uk[0],uk[1],color='black',marker="*")

plt.show()

deffit(self,X):

print(self.k)

self.initialization(X)

for i in range(self.max_iter):

print("i is ",i)

self.assignPointTOClusters(X)

self.plotClusters()

old_c = [self.clusters[i]['center'] for i in range(self.k)]

self.updateClusters()

new_c = [self.clusters[i]['center'] for i in range(self.k)]

if self.check(old_c,new_c):

break

# K-Means++ : Just a Introduction

- K-Means is quite sensitive to initalization, if the init is not good, our algorithm is not able to make desired number of clusters.
- To overcome this problem, we use technique called K-Means++ (described in paper Robust Seed Selection for K-Means type of Algorithms) which chooses initial centers so that they are statiscallly close to final ones.