K-Means Clustering for Image Classification

Yes! K-Means Clustering can be used for Image Classification of MNIST dataset. Here’s how.

10 min readJan 2, 2020

K-means clustering is an unsupervised learning algorithm which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest centroid. The algorithm aims to minimize the squared Euclidean distances between the observation and the centroid of cluster to which it belongs.

K-Means clustering is not limited to the consumer information and population scientist. It can be used for Imagery analysis as well. Here we would use K-Means clustering to classify images of MNIST dataset.

Getting to know the data

The MNIST dataset is loaded from keras.

# Importing the dataset from kerasfrom keras.datasets import mnist(x_train, y_train), (x_test, y_test) = mnist.load_data()

The MNIST dataset is a benchmark dataset in the machine learning community which consists of 28 x 28 pixel images of digits from 0 to 9. Let us get to know more about the dataset.

# Checking the ‘type’print(type(x_train))
print(type(x_test))
print(type(y_train))
print(type(y_test))

All of them are numpy arrays.

# Checking the shapeprint(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

The output is (60000,28,28), (10000,28,28), (60000,1), (10000,1). The ‘x_train’ and ‘x_test’ consist of 60000 and 10000 monochrome images respectively . The pixel size of each image is 28 x 28.

Every input image has an output which is the number displayed in the image. Thus ‘y_train’ and ‘y_test’ are of size (60000,1) and (10000,1).

plt.gray() # B/W Imagesplt.figure(figsize = (10,9)) # Adjusting figure size# Displaying a grid of 3x3 imagesfor i in range(9):
 plt.subplot(3,3,i+1)
 plt.imshow(x_train[i])

Initially all the images were of different pixel sizes. Through Image Scaling they have been reduced to a common pixel size of 28 x 28. The details of the image are lost due to reduction in pixel size and hence the images are blurred.

# Printing examples in 'y_train'for i in range(5):
  print(y_train[i])

The output is 5,0,4,1,9. The ‘y_train’ and ‘y_test’ are digits from 0 to 9 which indicate the number displayed in image.

Preprocessing the Data

# Checking the minimum and maximum values of x_trainprint(x_train.min())
print(x_train.max())

The minimum and maximum values are 0 and 255 respectively. In the RGB color space the red, green and blue use 8 bits each which have integer values from 0 to 255. So the total number of possible colors is 256*256*256 = 16777216. Sounds astonishing?

Since the dataset contains a range of values from 0 to 255, the dataset has to be normalized. Data Normalization is an important preprocessing step which ensures that each input parameter (pixel, in this case) has a similar data distribution. This fastens the process of covergence while training the model. Also Normalization makes sure no one particular parameter influences the output significantly.

Data normalization is done by subtracting the mean from each pixel and then dividing the result by the standard deviation. The distribution of such data would resemble a Gaussian curve centered at zero. For image inputs we need the pixel numbers to be positive. So the image input is divided by 255 so that input values are in range of [0,1].

# Data Normalization# Conversion to floatx_train = x_train.astype(‘float32’) 
x_test = x_test.astype(‘float32’)# Normalizationx_train = x_train/255.0
x_test = x_test/255.0

Now we again check the minimum and maximum values of input.

# Checking the minimum and maximum values of x_trainprint(x_train.min())
print(x_train.max())

The minimum and maximum values are 0 and 1 respectively. The input data is in range of [0,1].

The input data have to be converted from 3 dimensional format to 2 dimensional format to be fed into the K-Means Clustering algorithm. Hence the input data has to be reshaped.

# Reshaping input dataX_train = x_train.reshape(len(x_train),-1)
X_test = x_test.reshape(len(x_test),-1)

Now let us check the shape of ‘X_train’ and ‘X_test’.

# Checking the shapeprint(X_train.shape)
print(X_test.shape)

The output is (60000,784) and (10000,784). (28 x 28 = 784)

Now that preprocessing of data is done, we move ahead to building of model with Mini Batch K-Means.

Building the model

Mini Batch K-Means works similarly to the K-Means algorithm. The difference is that in mini-batch k-means the most computationally costly step is conducted on only a random sample of observations as opposed to all observations. This approach can significantly reduce the time required for the algorithm to find convergence with only a small cost in quality.

from sklearn.cluster import MiniBatchKMeanstotal_clusters = len(np.unique(y_test))# Initialize the K-Means modelkmeans = MiniBatchKMeans(n_clusters = total_clusters)# Fitting the model to training setkmeans.fit(X_train)

The model has been fit to the training data. Now we run the kmeans.labels_

kmeans.labels_

The images are classified into clusters based on similarity of pixel values. Each image is assigned a cluster label value given by kmeans.labels_. So kmeans.labels_ is an array of length 60000 as there are 60000 images in the training set.

But the kmeans.labels_ only denotes the cluster to which the image belongs to. It doesn’t denote the number displayed in image. Hence we write a separate function to retrieve the necessary information from kmeans.labels_

def retrieve_info(cluster_labels,y_train):
 ‘’’
 Associates most probable label with each cluster in KMeans model
 returns: dictionary of clusters assigned to each label‘’’# Initializingreference_labels = {}# For loop to run through each label of cluster labelfor i in range(len(np.unique(kmeans.labels_))):index = np.where(cluster_labels == i,1,0)
  num = np.bincount(y_train[index==1]).argmax()
  reference_labels[i] = numreturn reference_labels

Let us look at ‘reference_labels’.

print(reference_labels)

The output is :-
{0: 8, 1: 1, 2: 2, 3: 1, 4: 6, 5: 7, 6: 4, 7: 3, 8: 0, 9: 9} ( i.e A cluster label of 0 is a cluster of images of 8, a cluster label of 1 is a cluster of images of 1 and so on).

It can be seen that cluster labels of 1 and 3 both denote the cluster of images of 1 and none of the cluster labels denote the cluster of images of 5. These can be solved by optimizing which would be discussed later in the article.

We run the ‘retrieve_info’ function and process it to get ‘number_labels’ which denotes the number displayed in image.

reference_labels = retrieve_info(kmeans.labels_,y_train)number_labels = np.random.rand(len(kmeans.labels_))for i in range(len(kmeans.labels_)):  number_labels[i] = reference_labels[kmeans.labels_[i]]

Now we print the predicted ‘number_label’ and the actual label for the first 20 training examples.

# Comparing Predicted values and Actual valuesprint(number_labels[:20].astype(‘int’))
print(y_train[:20])

The output is :-
[8 0 9 1 4 2 1 8 1 4 3 1 3 6 1 7 2 8 1 7]
[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9]

# Calculating accuracy scorefrom sklearn.metrics import accuracy_scoreprint(accuracy_score(number_labels,y_train))

The accuracy score is about 55%.

Now let us optimize the algorithm for better results.

Optimizing the Algorithm

The performance of model is measured by the following 3 metrics :- Inertia, Homogeneity score and Accuracy score.

Inertia is a measure of how internally coherent clusters are. It is inversely proportional to sum of squares of distances between data points and their respective cluster’s centroid. Higher the number of clusters, lower is the inertia score.

In clustering, a data point can belong to more than one cluster with some probability or likelihood value. The data points at the borderline of clusters can belong to more than 1 cluster. Homogeneity is a measure of data points of a particular cluster belonging to a single class.

Accuracy score is the percentage of correctly predicted values.

We define a function that calculates metrics for the model.

# Function to calculate metrics for the modeldef calculate_metrics(model,output):
 print(‘Number of clusters is {}’.format(model.n_clusters))
 print(‘Inertia : {}’.format(model.inertia_))
 print(‘Homogeneity :       {}’.format(metrics.homogeneity_score(output,model.labels_)))

Now we run the model for different values of ‘number of clusters’.

from sklearn import metricscluster_number = [10,16,36,64,144,256]for i in cluster_number:total_clusters = len(np.unique(y_test))# Initialize the K-Means modelkmeans = MiniBatchKMeans(n_clusters = i)# Fitting the model to training setkmeans.fit(X_train)# Calculating the metrics
 
calculate_metrics(kmeans,y_train)# Calculating reference_labelsreference_labels = retrieve_info(kmeans.labels_,y_train)# ‘number_labels’ is a list which denotes the number displayed in imagenumber_labels = np.random.rand(len(kmeans.labels_))for i in range(len(kmeans.labels_)):
 
 number_labels[i] = reference_labels[kmeans.labels_[i]]
 
print(‘Accuracy score : {}’.format(accuracy_score(number_labels,y_train)))print(‘\n’)

The output is as given below :-

Number of clusters is 10
Inertia : 2374596.25
Homogeneity : 0.44577826488898536
Accuracy score : 0.5496

Number of clusters is 16
Inertia : 2242527.25
Homogeneity : 0.5248907780235825
Accuracy score : 0.6086

Number of clusters is 36
Inertia : 1964164.875
Homogeneity : 0.6916427302065263
Accuracy score : 0.7745

Number of clusters is 64
Inertia : 1810550.125
Homogeneity : 0.7442898553384553
Accuracy score : 0.82385

Number of clusters is 144
Inertia : 1632715.875
Homogeneity : 0.8060049496038609
Accuracy score : 0.8646333333333334

Number of clusters is 256
Inertia : 1515926.875
Homogeneity : 0.8384350512732647
Accuracy score : 0.8941

It can be observed that as the number of clusters increases,

(1 The inertia score decreases because sum of squares of distances between data points and their respective cluster’s centroid decreases and clusters become more internally coherent.
(2 The homogeneity score increases the clusters become more differentiable and number of data points having a single class label is high.
(3 The accuracy score increases. The reason will be discussed later in the article.

The accuracy is highest for ‘number of clusters’ = 256. Hence we run the model on the testing set for number of clusters = 256.

# Testing model on Testing set# Initialize the K-Means modelkmeans = MiniBatchKMeans(n_clusters = 256)# Fitting the model to testing setkmeans.fit(X_test)# Calculating the metricscalculate_metrics(kmeans,y_test)# Calculating the reference_labelsreference_labels = retrieve_info(kmeans.labels_,y_test)# ‘number_labels’ is a list which denotes the number displayed in imagenumber_labels = np.random.rand(len(kmeans.labels_))for i in range(len(kmeans.labels_)):
 
 number_labels[i] = reference_labels[kmeans.labels_[i]]
 
print(‘Accuracy score : {}’.format(accuracy_score(number_labels,y_test)))print(‘\n’)

The output is as given below :-

Number of clusters is 256
Inertia : 246126.453125
Homogeneity : 0.8574948835978999
Accuracy score : 0.903

The accuracy score on training and testing set are very similar and close to 90% which suggests the model is not overfitting the training data and can be generalized well to new data.

A question to ponder is that why do we need 256 clusters when there are only 10 digits. This is because there can be multiple ways to write a particular number. The orientation and style of writing a number can be different and thus the algorithm views them as drastically different images. Hence we need more than 1 cluster to represent the images of a particular number.

The above point can be verified by Visualizing the centroids of each cluster.

Visualization of Cluster Centroids

Each cluster has a centroid which is the most representative point of the cluster. If we can visualize the cluster centroid, we can get an idea of the other images in the cluster.

# Cluster centroids is stored in ‘centroids’centroids = kmeans.cluster_centers_

Let us look at the shape of ‘centroids’.

centroids.shape

The ‘centroids’ is of shape (256,784). There are 256 cluster centroids and each cluster centroid has 784 features.

We reshape the centroids from 2 dimensional format to 3 dimensional format so that we can view them as images.

centroids = centroids.reshape(256,28,28)

We had normalized the data. So now we nullify the normalization effect by multiplying by 255.

centroids = centroids * 255

Now let us visualize the cluster centroids.

plt.figure(figsize = (10,9))bottom = 0.35for i in range(16):
 plt.subplots_adjust(bottom)
 plt.subplot(4,4,i+1)
 plt.title(‘Number:{}’.format(reference_labels[i]),fontsize = 17)
 plt.imshow(centroids[i])

Each image is a cluster centroid image. It can be seen that there are 5 clusters which denote the number 4. The style and orientation of all 5 cluster centroid images are different.

Suppose there are 2 images A and B which denote 2 different numbers but have the same style. If the number of clusters was less, the 2 images A and B would be clustered together. This decreases the accuracy of the model.

A particular number can be written in different styles and orientations. Increasing the number of clusters helps to assign a separate cluster to each style and orientation. This improves the accuracy of the model.

Predicting number displayed in user input image

The K-Means clustering model runs successfully on the MNIST data with an accuracy of 90%. But can it be used to predict other hand written images? Let us try to find out.

The below given image is an image of number 4 created in Microsoft Paint.

The pixel size of the image is 819 x 460. It is reduced to a pixel size of 28 x 28 as all other images in MNIST data set.

# Reading the imageimage = plt.imread(‘number_4.jpg’)
plt.imshow(image)

The image is blurry because it has been reduced from 819 x 460 to 28 x 28.

image.shape

The shape is (28, 28, 3) which suggests it is a RGB image as there are 3 channels. The RGB image is converted to monochrome image as all images in MNIST data set are monochrome images.

# RGB image is converted to Monochrome imagefrom skimage import color
from skimage import ioimage = color.rgb2gray(io.imread(‘number_4.jpg’))

The ‘image’ is reshaped into a single row vector to be fed into K-Means clustering algorithm.

# Reshaping into a row vectorimage = image.reshape(1,28*28)

The shape of the ‘image’ is (1,784).

The MNIST data set is imported.

# Importing the dataset from kerasfrom keras.datasets import mnist(x_train, y_train), (x_test, y_test) = mnist.load_data()

Let us check the minimum and maximum pixel values of ‘x_train’ and ‘image’.

print(x_train.min())
print(x_train.max())
print(image.min())
print(image.max())

The minimum and maximum pixel values of ‘x_train’ and ‘image’ are (0,255) & (0,0.5) respectively. This suggests normalization is required for ‘x_train’ but not for ‘image’.

# Normalization of ‘x_train’x_train = x_train.astype(‘float32’)
x_train = x_train/255.0

‘x_train’ is reshaped from 3 dimensional format to 2 dimensional format.

# Reshaping of ‘x_train’x_train = x_train.reshape(60000,28*28)

The model is trained.

# Training the modelkmeans = MiniBatchKMeans(n_clusters=256)
kmeans.fit(x_train)

The ‘retrieve_info’ function is run and processed to get ‘number_labels’ which denotes the number displayed in image.

reference_labels = retrieve_info(kmeans.labels_,y_train)number_labels = np.random.rand(len(kmeans.labels_))for i in range(len(kmeans.labels_)):  number_labels[i] = reference_labels[kmeans.labels_[i]]

The cluster into which the image is classified into is predicted.

predicted_cluster = kmeans.predict(image)

The number in the image is predicted.

number_labels[[predicted_cluster]]

The output is 4. The model has successfully predicted the number displayed in ‘image’.

Happy Reading!