Image Compression with K-means Clustering

Published in

CodeX

7 min readOct 11, 2021

Source: https://www.vertica.com/blog/finding-the-k-in-k-means-clustering-with-a-udf/

Hello to everyone. We’ve only seen Supervised learning algorithms in my prior articles. But today, we’ll go into our first Unsupervised learning algorithm. In this post, we will learn about K-means and begin with an example 2D dataset. Following that, we will apply the K-means algorithm to compress images by reducing the number of colors in a picture to only those that are common in that image.

In Unsupervised learning, our algorithm will learn from unlabelled data instead from a labelled data. So, what is Unsupervised learning? I briefly talked about it in What are the types of Machine Learning? . But it is useful to contrast with Supervised learning. So here is the Supervised learning problem below:

In the above case, given a set of labels to fit a hypothesis to it. In contrast, in Unsupervised learning problem we are given data that does not have any labels associated with it. So we’re given dataset like below:

Here, the dataset is given set of points and then no labels and so our training set is just written x1, x2,…, xm and we don’t get any labels y. So in unsupervised learning we just ask the algorithm to find some structure or some pattern in this unlabelled data. The above circled are called clustering. And this is our first type of unsupervised learning algorithm.

What is K-means?

The K-means algorithm is a technique for automatically grouping together similar data samples. For example, suppose you are given a training set x1,…,xm and wish to arrange the data into a few single unified “clusters”.

The more concrete examples of its uses are:

Market Segmentation: You may have a customer database and wish to categorize it into different market groups. As a result, you may sell to them independently or better service your various target groups.
Social Network Analysis: Things like Facebook, Instagram, etc., For example, information on who you email the most regularly and who they email the most frequently, in order to discover a cohesive group of individuals. Another clustering algorithm where you want to find other coherent group of friends in the social network.
Organize Data Centers: If you know which computers in the data center and which cluster tend to work together. You can use that to recognize your resource and how you layout into networks and how you design your data center and communication.
Explore Galaxies: Clustering algorithm to understand galaxy formation and using that to understand how to understand astronomical data.

Let’s load our dataset really quick:

import numpy as np
from scipy.io import loadmat
import matplotlib.pyplot as plt
import matplotlib.image as imgmat = loadmat('ex7data2.mat')
X = mat['X']

K-means Algorithm:

The one grouped in blue is cluster assignment step and the one that is grouped in red is move centroid step.

Cluster Assignment Step: In this step, we colour all of the training samples red or blue based on which cluster centroid they are closest to.
c⁽ⁱ⁾ is a value ranging from 1 to K that indicates whether the training samples are closer to the red cross or closer to the blue cross.

In the above formula, we take training sample x⁽ⁱ⁾ and measure the distance between x⁽ⁱ⁾ and every μₖ (centroid). It finds the value of k that minimize the distance between x⁽ⁱ⁾ and μₖ. Keep an eye that k used here is lowercase k, it means for every cluster centroid. And uppercase K is used to denote total number of centroids.

You can gain intuition by seeing the figure below.

Move Centroid step: In this step, for centroids from 1 to K, we average the points assigned to the each cluster centroid k. For more concrete example,

Suppose, you take training samples x⁽¹⁾, x⁽⁵⁾, x⁽⁸⁾, x⁽¹⁰⁾ and they are assigned to second cluster centroid (i.e., k=2) then it is represented as c⁽¹⁾=2, c⁽⁵⁾=2, c⁽⁸⁾=2, c⁽¹⁰⁾=2 (here 2 represents cluster centroid k=2 and superscripts represents corresponding training example). In move centroid step, we do the following: Take second cluster centroid for example, from the above we can see that 4 samples (i.e., x⁽¹⁾, x⁽⁵⁾, x⁽⁸⁾, x⁽¹⁰⁾) are assigned to it. By taking average of those, we can move the centroid.

By repeating these two steps, we can cluster them into groups. Below gif can help you understand the iterative process of this algorithm.

Random Initialization:

Should have K < m (i.e., total number of centroids must be less than total number of samples in training set)
Randomly pick K training examples.
Set μ₁,…., μₖ equal to these K examples. We need assign μₖ on the particular training example.

Random Initialization:

# Random Initializationdef kMeansInit(X, K):
    
    m, n = X.shape
    centroids = np.zeros((K, n))
    
    for i in range(K):
        centroids[i] = X[np.random.randint(0, m+1), :]
    return centroids

Cluster assignment step:

# 1. Cluster Assignment Step
# Finding closest centroid (i.e., c⁽ⁱ⁾)def findClosestCentroids(X, centroids):
    K = centroids.shape[0]
    m = X.shape[0]
    
    idx = np.zeros((m, 1))
    temp = np.zeros((K, 1))
    
    for i in range(m):
        for j in range(K):
            temp[j] = np.sqrt(np.sum( (X[i, :] - centroids[j, :])**2 ))
        idx[i] = np.argmin(temp)+1
    return idx

Move centroid step:

# 2. Move Centroid Step
# Computing Centroids (i.e., averaging over assigned examples μₖ)def computeCentroids(X, idx, K):
    m, n = X.shape
    centroids = np.zeros((K, n))
    count = np.zeros((K, 1))
    for i in range(m):
        index = int((idx[i]-1)[0])
        centroids[index, :] += X[i, :]
        count[index] += 1
    return centroids/count

Visualization:

# Visualizing k-means on each iterationdef plotKmeans(X, idx, K, centroids, num_iters):
    m, n = X.shape
    fig, axis = plt.subplots(nrows=num_iters, ncols=1, figsize=(6, 36))
    
    for i in range(num_iters):
        color = 'rgb'
        for k in range(1, K+1):
            grp = (idx==k).reshape(m, 1)
            axis[i].scatter(X[grp[:, 0], 0], X[grp[:, 0], 1], c=color[k-1], s=15)
        axis[i].scatter(centroids[:, 0], centroids[:, 1], s=120, marker="x", c="black", linewidth=3)
        title = "Iteration Number: " + str(i)
        axis[i].set_title(title)
        centroids = computeCentroids(X, idx, K)
        idx = findClosestCentroids(X, centroids)
        plt.tight_layout()

Image Compression

Load our image:

mat2 = loadmat('bird_small.mat')
A = mat2['A']

Run K-means on given image:

X2 = (A/255).reshape(128*128, 3)K2 = 16
num_iters = 10def runKMeans(X, initial_centroids, K, num_iters):
    idx = findClosestCentroids(X, initial_centroids)
    
    for i in range(num_iters):
        centroids = computeCentroids(X, idx, K)
        idx = findClosestCentroids(X, centroids)
    return centroids, idx
initial_centroids = kMeansInit(X2, K2)
centroids2, idx2 = runKMeans(X2, initial_centroids, K2, num_iters)

Compressing and visualization:

m2, n2 = X2.shape
X2_recovered = X2.copy()

for i in range(1,K2+1):
    X2_recovered[(idx2==i).ravel(),:] = centroids2[i-1]
    
X2_recovered = X2_recovered.reshape(128,128,3)

import matplotlib.image as mpimg
fig, ax = plt.subplots(1,2)
ax[0].imshow(X2.reshape(128,128,3))
ax[1].imshow(X2_recovered)

The above image in the right is compressed to 16 most dominant colors in the image. Try changing the value of K2.

Conclusion

Today, we saw under the hood of K-means and how it actually works. Then it was created from scratch using python’s numpy, pandas and matplotlib. The dataset and final code is uploaded in github. Try playing with the values of K2 and you can see the outputs of every block there.

Check it out here K-means.

If you like this post, then check out my other posts in this series about

1. What is Machine Learning?

2. What are the Types of Machine Learning?

3. Uni-Variate Linear Regression

4. Multi-Variate Linear Regression

5. Logistic Regression

6. What are Neural Networks?

7. Digit Classifier using Neural Networks

8. Dimensionality Reduction on Face using PCA

9. Detect Failing Servers on a Network using Anomaly Detection

Last Thing

If you enjoyed my article, a clap 👏 and a follow would be 🤘massively appreciated🤘 and it is helpful for Medium to promote this article so that others can read it. I am Jagajith and I will catch you in the next one.