K-Means Clustering | Image Compression Done Right

Vedant Jain
Analytics Vidhya
Published in
5 min readApr 28, 2020

Looking into K-means, Elbow Method ( WCSS ) AND Image Compression in Python

I hope you read this Medium in the best of your health and working spirits. The lockdown due to Covid-19 has given unprecedented time at hand to explore more. This is my first attempt at Data Science as a programmer and on Medium as a writer; so bear with me.

Data points are initial scattered pieces of raw observations with us; having little or no information about their relation.
Clustering is finding some relation between data-points and clubbing similar ones together. In the current, global situation one regularly hears in media about clusters of Covid-19, wherein Covid-19 being the relation between data-points ( humans ).

Celestial Clusters

In complex scenarios where the relation between the data-points isn’t easily noticeable, a variety of Mathematical and Computer Science tools are employed to find the same.
K-Means Algorithms is one such method.

How it does, what it does?

First, K-Means picks K number of random initial points ( calling them InitPoints ahead ) in the N-Dimensional space (N being the independent properties/attributes of the data points ( let’s say, for a Covid patient these attributes could be Age, Blood Pressure, prior Respiratory problems, etc).

Second, A distance ( assuming euclidean below ) between every data-point and InitPoint is calculated. Then every data-point is assigned to its nearest InitPoint. Thus forming the initial cluster set.

Distance could be Minkowski, euclidean, or other distance.

Third, Every InitPoint is shifted to the newly obtained centroid ( layman: center/middle ) of respective data-points assigned to it.

Repeat the Second step of finding distance between every data-point and new centroids (older InitPoints) and re-assigning data-points to the nearest centroid ( shifted InitPoints ). Followed by step third.
These two are steps are repeated until no new assignment of data-points to a new different nearest centroid/InitPoint happens. i.e. centroids are correctly allocated to all data-points.

What we are having now is an N-Dimensional space with K clusters.

Cluster Centroids Obtained by repeated Distance calculation of points and InitPoints

How to decide on the number “K” of random initial points?
Elbow Method

We start with K = 1 and assume all data-points are in one cluster. Then following K-Means algorithm we get the right centroid.
Thereafter we calculate what is called “Within-Cluster-Sum-of-Squares”,
which is the Sum of Squares of distances of every data-point from its assigned centroid/InitPoint after K-means has been performed.

Mathematically given as :-

We take K=2, K= 3 … so on, calculating WCSS on the go.
Then we plot a graph between values of K against WCSS.

Which comes up like something like this :

Thus we see that for K = 3, an elbow-shaped bend is observed, for some given sample of data points; giving us an idea about the count InitPoints, K to pick. Which eventually would be the count of clusters.
One must observe that if we keep increasing the number of clusters, the WCSS keeps coming down. And if we make as many clusters as the number of data-points. The WCSS will come down to zero. Since every point itself would be the centroid.

Such Clustering doesn’t solve any purpose. Rather, picking up initial points, randomly has its own problem called Random Initialization Trap, leading to different end results (set of clusters) for different start InitPoints. Thus please read out more about “K-means++” to avoid this trap.

Image Compression Done Right

K-Means Algorithm has different applications varying from predicting the right audience for a marketing campaign to Text and Image clustering.

Below is a practical implementation of K-Means used for Compressing an image.

I am an avid Traveler, so here I pick one of my favorite pictures from Kheergana Trek in the mighty Himalayas and compress it, programming in Python.

Complete Code As :

Using KMeans Clustering for Image Compression

Before and After :

Here the initial picture is 1200*1600 pixels. Since each color/pixel is made up of three Primary colors, Red, Green, Blue; called RGB. The RGB value per pixel can be anywhere from (0,0,0) to (255,255,255).
The approach is to read all the pixels from the Image and form 64 clusters in a 3 Dimensional Space ( R,G,B ) for every pixel. RGB are our data-points here.

The centroids to these clusters would hold the RGB value, we need, to print all those pixels under this cluster.
Essentially the right image above is only made of 64 Colors !!!

Reading the image pixels :

org_image = Image.open('Kheerganga.jpeg','r')
size = org_image.size
pixels = list(org_image.getdata())
final_pixels = []
for i in range(0,len(pixels)):
final_pixels.append( np.array(pixels[i]) )
final_pixels = np.array(final_pixels)

Here we are taking 64 clusters and predict the cluster to each pixel as :

kmeans = KMeans(n_clusters = 64, init = 'k-means++', n_init= 10, max_iter = 20)
y_kmeans = kmeans.fit_predict(final_pixels)

The below code gives us the centroid :

centroids = kmeans.cluster_centers_

Finally, we traverse over every pixel, fetch its cluster (out of 64), and corresponding centroid value (RGB).

We do the plot as :

X = np.arange(0,org_image.size[0])
for y in range(0,org_image.size[1]):
plt.scatter(np.full((1,org_image.size[0]),y)[0],X ,c = kmeans.cluster_centers_[ykmeans[y][X]]/255)
plt.show()

That’s my little review of the Algorithm. Constructive criticism, comments, and suggestions are most welcome ;) Until next time. #StayHomeStaySafe

--

--

Vedant Jain
Analytics Vidhya

Software Engineer @ Microsoft | Ex-Qualcomm | Voracious Programmer | Ferocious Extrovert | Avid Traveler