Classification of unlabelled images using KMeans, Transfer Learning & CNN

Brice Nicodem Simeu
13 min readJun 9, 2023

--

Introduction

A Convolutional Neural Network (CNN) used for image classification consists of numerous layers that identify various features in the image, such as edges, corners, and more. These extracted features are then utilized by a final fully-connected layer to determine and classify objects present in the image. One way to represent this process visually is as follows:

Transfer Learning is a technique that involves utilizing a pre-trained model and leveraging its feature extraction layers, while replacing the final classification layer with a custom fully-connected layer trained on your own specific images. By adopting this approach, you can take advantage of the base model’s feature extraction capabilities, which were initially trained on a more extensive dataset than what you have access to. Consequently, you can develop a classification model tailored to your specific object classes using the knowledge gained from the pre-existing model’s feature extraction training.

For this article, we will be using a dataset containing images brain cancer that I’ve downloaded from Kaggle. This dataset is pre-labeled and consists of three main folders: “brain_glioma”, “brain_menin” and “brain_tumor” Each folder corresponds to a category of brain cancer and contains associated images. Our initial step involves creating a new folder and transferring all the images from these existing folders into the new one. By doing this, we will have a dataset that contains all the images but lacks labels.

Next, we will apply the K-means algorithm to perform clustering on these images. The clusters generated will assist us in segregating the brain glioma, brain menin and brain tumor images into separate folders, effectively assigning labels to them. Once this process is completed, we will have a dataset with labeled images ready for further use.

Subsequently, we will employ transfer learning techniques to construct our classification model. This entails utilizing the pre-existing knowledge and parameters from a trained model and adapting them to our specific task. By leveraging transfer learning, we can expedite the training process and improve the performance of our model in classifying glioma, menin & tumor images.

So let us get started…

!mkdir images
# Copy all the files in the sub-folders to the images folder.
# Note : replace 'copy' by 'cp' if you are using linux as os
!copy BrainCancer\brain_glioma\* images
!copy BrainCancer\brain_menin\* images
!copy BrainCancer\brain_tumor\* images

Import the relevant libraries

from tensorflow.keras.preprocessing import image
from tensorflow.keras.utils import load_img,img_to_array
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import layers
import tensorflow as tf
from sklearn.cluster import KMeans
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import os
import shutil
%matplotlib inline

CNN features extraction

The images have to be compared with each other in order to be grouped together by KMeans hence we need to extract the features of the images. For this purpose, a pre-trained VGG16 model trained on the ImageNet dataset will be used. The classification head of the model will be removed by setting the parameter include_top to False in the constructor of the VGG16 class, resulting in the model containing only the convolutional layers for feature extraction.

Let us define a feature extractor function to extract images features. This function first loads the image and then resizes it into 224 x 224 dimensions. It will then converts the image to a numpy array as by default Keras load_img function opens the image in PIL image format that has to be converted into a numpy array for performing other operations. The function adds a batch dimension to the image along the first axis and then applies the VGG16 preprocessing function, called preprocess_input. This function will performs all the operations that were used during training of the default VGG16 model. Finally, the preprocessed image is passed to the feature extractor neural network and are returned in the form of a flattened numpy array.

from tensorflow.keras.applications.vgg16 import VGG16,preprocess_input

#Define the vgg feature extractor.
vgg_model=VGG16(weights='imagenet',include_top=False)

def extract_features(image_path):
img=image.load_img(image_path,target_size=(224,224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
predictions = vgg_model.predict(x)
predictions=predictions.flatten()

return predictions

Creating lists of file paths and image features

Let’s start by creating a list of paths for all files in our dataset, which we will use to map the image features to their corresponding file paths. For instance, if an image feature belongs to cluster 1, we can use this list to identify the file path for that image. We iterate through this list and extract the features of each image stored at these file paths, and append them to a separate list.

#Create lists of features and paths.
features=[]
images_list=os.listdir('./images')
images_list=list(map(lambda x:'./images/'+x,images_list))

for i in tqdm(range(len(images_list))):
filepath=images_list[i]
img_features=extract_features(filepath)
features.append(img_features)

Reduction of the analysis dimension using PCA

from sklearn import manifold, decomposition
features_array = np.array(features)
print("Shape of the features before the application of the PCA transformation {}".format(features_array.shape))
pca = decomposition.PCA(n_components=0.9)
reduc_features = pca.fit_transform(features_array)
print("Shape of the features after PCA transformation {}".format(reduc_features.shape))

As you can see on the above screenshots, the feature list features contains 25088 data points (or features) for each instance (image). So you could interpret these as coordinates that describe each instance’s location in 25088-dimensional space.

Despite our existence in a three-dimensional world, we used a mathematical approach known as Principal Component Analysis (PCA) to examine the links between features and condense each observation into coordinates for 2486 principal components. In simpler terms, we have converted the 25088 dimensional feature values into 2486 dimensional coordinates, allowing us to summarise and analyse the data in a more manageable format.

Clustering of Images

We know that the images we have to belong to three categories i.e brain glioma, brain menin & brain tumor hence, let us set the number of cluster k=3.

Having defined the number of clusters we can fit the KMeans algorithm on these feature images.

k = 3
clusters = KMeans(k)
clusters.fit(reduc_features)

Creation of a cluster table of file paths

In line no 1 we are creating a table using the images_list or the file path list that we create above. Then we are using the labels_ attribute of the KMeans class to access the cluster label list which has been achieved as a result of fitting the model on the image features. The labels obtained have the same index as the features list class and hence the same as the file path list. In line 2 we create another column Cluster_label and assign these labels to that.

df_results = pd.DataFrame(images_list,columns=['image_file'])
df_results["Cluster_label"] = clusters.labels_
df_results.sample(10)

Let’s take a look at the result of KMeans

for cluster in list(set(df_results.Cluster_label.unique())):
ls_filename = df_results[df_results['Cluster_label']==cluster]['image_file']
glioma =0
menin = 0
tumor = 0
other =0
for filename in ls_filename:
if 'glioma' in filename:
glioma = glioma+1
elif 'menin' in filename:
menin = menin+1
elif 'tumor' in filename:
tumor = tumor+1
else:
other = other+1
print("******************* cluster : {} *******************".format(cluster))
print('glioma: ',glioma)
print('menin: ',menin)
print('tumor :',tumor)
print('other :',other)

From the above, it is evident that most of the time, tumor images have been assigned a cluster label of 0, menin to cluster label of 1 and glioma images have been assigned a label of 2.

Unfortunately, the results obtained by the KMeans algorithm are not considered satisfactory due to the presence of significant overlap between the clusters, particularly clusters 0 and 1. It should be noted that an approximately equal number of images from the tumour and menin classes were assigned to the two clusters. Consequently, the files we construct on the basis of the generated clusters will not be reliable for the rest of our case study.

Let us confirm this by calculating the adjusted rand score

df_results['class'] = np.where(
df_results['image_file'].str.contains("tumor"), "brain tumor",
(np.where(df_results['image_file'].str.contains("menin"), "brain menin",
(np.where(df_results['image_file'].str.contains("glioma"), "brain glioma", 'unknow')
))))

df_results['real_label'] = np.where(
df_results['class'].str.contains("brain tumor"), 0,
(np.where(df_results['class'].str.contains("brain menin"), 1,
(np.where(df_results['class'].str.contains("brain glioma"), 2, -1)
))))

df_results.sample(20)
from sklearn.metrics.cluster import adjusted_rand_score
adjusted_rand_score(list(df_results.real_label), list(df_results.Cluster_label))

This score tells us that only 10% of our images were correctly grouped by KMeans, which is not very interesting.

Despite this poor result, we will continue our study until the end. The process is more important here than the results. :-)

Note

Several reasons could the low accuracy of our grouping :

  • Images of poor quality
  • Perhaps the pre-trained model we have used is not the best cadidat

Possible ways of improving :

  • Used a different transfer learning model, MobileNetV2 for example
  • Convert the images to shades of gray when extracting features by setting the color_mode parameter of the tf.keras.utils.load_img function to grayscale

Distribution of data before and after assignment by KMeans.

Let’s first take advantage of the TSNE to reduce the dimension of our features to two

import seaborn as sns
from sklearn import manifold, decomposition

tsne = manifold.TSNE(n_components=2, perplexity=30, n_iter=2000, init='random', random_state=6)
features_embedded = tsne.fit_transform(reduc_features)


df_tsne = pd.DataFrame(features_embedded, columns=['dim1', 'dim2'])
print(df_results.shape)

Before …

df_tsne["class"] = df_results["class"]

plt.figure(figsize=(8,5))
sns.scatterplot(x="dim1", y="dim2", hue="class", data=df_tsne, legend="brief")
plt.title('Original Data')
plt.xlabel('dim1')
plt.ylabel('dim2')
plt.legend(prop={'size': 12})

plt.axis('off')
plt.show()

After KMeans assignment

df_tsne['Cluster_label'] = df_results["Cluster_label"] 

plt.figure(figsize=(8,5))
sns.scatterplot(x="dim1", y="dim2", hue="Cluster_label", data=df_tsne, legend="brief")
plt.title('Cluters Assignments')
plt.xlabel('dim1')
plt.ylabel('dim2')
plt.legend(prop={'size': 12})

plt.axis('off')
plt.show()

As a reminder : 0 : tumor, 1 : menin & 2 : glioma

Using cluster labels to separate the images into separate folders

As mentioned before, tumor images are marked with 0, menin images are marked with 1 and glioma images are marked with 2. Using that information let’s split the image path data frame into three data frames tumor_img_df, menin_img_df and glioma_img_df. We further divide these three tables into train and test sets.

tumor_img_df = df_results[df_results.Cluster_label==0]
menin_img_df = df_results[df_results.Cluster_label==1]
glioma_img_df = df_results[df_results.Cluster_label==2]

# #Split the dataset into train and test.

train_tumor_img = tumor_img_df.iloc[0:int(0.9*len(tumor_img_df))]
test_tumor_img = tumor_img_df.iloc[int(0.9*len(tumor_img_df)):]

train_menin_img = menin_img_df.iloc[0:int(0.9*len(menin_img_df))]
test_menin_img = menin_img_df.iloc[int(0.9*len(menin_img_df)):]

train_glioma_img = glioma_img_df.iloc[0:int(0.9*len(glioma_img_df))]
test_glioma_img = glioma_img_df.iloc[int(0.9*len(glioma_img_df)):]

Generating a new dataset by utilizing the above data frames

Let’s create a directories and move the images into their respective folders. For example, images in train_glioma_img will be moved into the glioma folder of the ./dataset/train directory.

"""Create a new dataset folder having following below structure:
dataset
|
train
- tumor
- menin
- glioma
test
- tumor
- menin
- glioma

and copy the images to these folders.
"""

if os.path.exists('./dataset') is False:
os.mkdir('dataset')

def create_train_test_forders(folders_ls):
for folder in folders_ls:
if os.path.exists('./dataset/'+folder) is False:
os.mkdir('dataset/'+folder)
# Call the function
folders_ls = ['train', 'test']
create_train_test_forders(folders_ls)

def create_sub_train_test_folders(level, sub_folders_ls):
for sub_folder in sub_folders_ls:
if os.path.exists('./dataset/'+level+'/'+sub_folder) is False:
os.mkdir('dataset/'+level+'/'+sub_folder)

# Call the function
sub_folders_ls = ['tumor', 'menin', 'glioma']
create_sub_train_test_folders('train', sub_folders_ls)
create_sub_train_test_folders('test', sub_folders_ls)



#Copy the images files to the dedicated directories.
def copy_images_to_dedicated_train_test_sub_folder(level, images_df, folder_name):
for i in range(len(images_df)):
filepath=list(images_df['image_file'])[i]
filename=filepath.split('/')[-1]
shutil.copyfile(filepath, './dataset/'+level+'/'+folder_name+'/'+filename)

# Copy train images
copy_images_to_dedicated_train_test_sub_folder('train', train_tumor_img, sub_folders_ls[0])
copy_images_to_dedicated_train_test_sub_folder('train', train_menin_img, sub_folders_ls[1])
copy_images_to_dedicated_train_test_sub_folder('train', train_glioma_img, sub_folders_ls[2])

# Copy test images
copy_images_to_dedicated_train_test_sub_folder('test', test_tumor_img, sub_folders_ls[0])
copy_images_to_dedicated_train_test_sub_folder('test', test_menin_img, sub_folders_ls[1])
copy_images_to_dedicated_train_test_sub_folder('test', test_glioma_img, sub_folders_ls[2])

Training a classification model on the newly created labeled dataset

This section will cover the training of a classification model on the original labeled dataset(BrainCancer) not the one generated through clustering since the results don’t meet the expectations. Transfer learning will be used to train the model as it is expected to yield better outcomes.

Let us first show a random image of each category of brain cancer images

import matplotlib.image as mpimg
import random

# The images are in the BrainCancer folder
data_path = './BrainCancer'

# Get the brain cancer class names
classes = os.listdir(data_path)
classes.sort()
print('{} classes: {}'.format(len(classes), classes))

# random image of each category of cancer
fig = plt.figure(figsize=(20, 35))
i = 0
for sub_path in os.listdir(data_path):
i+=1
img_file = random.choice(os.listdir(os.path.join(data_path,sub_path)))
img_path = os.path.join(data_path, sub_path, img_file)
img = mpimg.imread(img_path)
a=fig.add_subplot(1, len(classes),i)
a.axis('off')
imgplot = plt.imshow(img)
a.set_title('Class : '+sub_path)
plt.show()

Prepare the data

The pretrained model has many layers, starting with a convolutional layer that starts the feature extraction process from image data.

For feature extraction to work with our own images, we need to ensure that the image data we use to train our prediction layer has the same number of features (pixel values) as the images originally used to train the feature extraction layers, so we need data loaders for color images that are 224x224 pixels in size.

Tensorflow includes functions for loading and transforming data. We’ll use these to create a generator for training and validation data.

It’s worth noting that the preprocess_input process is used in the preprocessing function, which is imported from the resnet class of Keras applications. The reason for this is that using the same transformation as used in imagenet classification improves the accuracy of the model.

from tensorflow.keras.applications.resnet import preprocess_input

batch_size=50
pretrained_size = (224,224)

#Defining data generator.
print("Getting Data...")
datagen = ImageDataGenerator(
preprocessing_function=preprocess_input,
validation_split=0.2) # hold back 20% of the images for validation

train_generator=datagen.flow_from_directory(
data_path,
target_size=pretrained_size, # resize to match model expected input
batch_size=batch_size,
class_mode='categorical',
subset='training' # set as training data
)

validation_generator=datagen.flow_from_directory(
data_path,
target_size=pretrained_size, # resize to match model expected input
batch_size=batch_size,
class_mode='categorical',
subset='validation' # set as validation data
)

classnames = list(train_generator.class_indices.keys())
print("class names: ", classnames)

Prepare the base model

In order to utilize transfer learning, we require a pre-trained base model whose feature extraction layers can be utilized. Resnet model is a convolutional neural network-based image classifier that has been pre-trained on a vast dataset of 3-color channel images of 224x224 pixels. We’ll instantiate the model with pre-trained weights, while excluding its final (include_top=False) prediction layer.

from tensorflow.keras import Model

base_model = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
# print(base_model.summary())

Creating a prediction layer

After downloading the ResNet model without its final prediction layer, we need to merge it with a fully connected layer (also called a dense layer). This dense layer will take the flattened outputs of the ResNet feature extraction layers and generate predictions for each class of brain cancer image.

In order to preserve the learned weights, we must freeze the feature extraction layers. As a result, when we train our model using our own images, only the final prediction layer will learn new weight and bias values. The pre-trained weights for feature extraction will remain unchanged.

from tensorflow.keras.layers import Flatten, Dense

# Freeze the already-trained layers in the base model
for layer in base_model.layers:
layer.trainable = False

# Create prediction layer for classification of our images
x = base_model.output
x = Flatten()(x)
prediction_layer = Dense(len(classnames), activation='softmax')(x)
model = Model(inputs=base_model.input, outputs=prediction_layer)

# Compile the model
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])

# Let's print the complete model, consisting of the layers from the ResNet model
# and the newly added dense layer.
print(model.summary())

Fitting the model

With the layers of the CNN defined, we’re ready to train it using our image data. The weights used in the feature extraction layers from the base resnet model will not be changed by training, only the final dense layer that maps the features to our shape classes will be trained.

num_epochs=3
history = model.fit(
train_generator,
steps_per_epoch = train_generator.samples // batch_size,
validation_data = validation_generator,
validation_steps = validation_generator.samples // batch_size,
epochs = num_epochs)

Loss history

We kept track of the average training and validation loss history for each epoch. We can plot these values to confirm that the loss decreased as the model underwent training and also to identify overfitting. Overfitting is characterized by a sustained drop in training loss, even after the validation loss has leveled off or started to increase.

Model performance

Although we can observe the final accuracy based on the test data, it is often better to analyse the performance measures in more detail. To assess the effectiveness of the model in predicting each class, we can generate a confusion matrix by plotting the predicted labels against the actual labels.

# Since Tensorflow does not provide a pre-built confusion matrix metric, we will utilize SciKit-Learn instead.
import numpy as np
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
%matplotlib inline

print("Generation of predictions from validation data...")
# Get the image and label arrays for the first batch of validation data
x_test = validation_generator[0][0]
y_test = validation_generator[0][1]

# Use the model to predict the class
class_probabilities = model.predict(x_test)

# The model returns a probability value for each class
# The one with the highest probability is the predicted class
predictions = np.argmax(class_probabilities, axis=1)

# The actual labels are hot encoded

true_labels = np.argmax(y_test, axis=1)

# Plot the confusion matrix
cm = confusion_matrix(true_labels, predictions)
plt.imshow(cm, interpolation="nearest", cmap=plt.cm.Blues)
plt.colorbar()
tick_marks = np.arange(len(classnames))
plt.xticks(tick_marks, classnames, rotation=85)
plt.yticks(tick_marks, classnames)
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
plt.show()

The confusion matrix should show a strong diagonal line indicating that there are more correct than incorrect predictions for each brain cancer class.

from sklearn.metrics import classification_report, f1_score
print('\n|===> f1 score: {}\n'.format(f1_score(true_labels, predictions, average='micro')))
print(classification_report(true_labels, predictions))

Testing the prediction on some custom data

def brain_cancer_class_prediction(path, images):
for image in images:
class_mapping={0:'brain glioma', 1:'brain menin', 2:'brain tumor'}
img = tf.keras.preprocessing.image.load_img(path+image,target_size=(224,224))
input_arr = tf.keras.preprocessing.image.img_to_array(img)
p_img = preprocess_input(input_arr)
p_img = np.expand_dims(p_img,axis=0)
pred = model.predict(p_img).argmax()
print('image: {} \t Prediction: {} ===> Class : {}'.format(image, pred, class_mapping[int(pred)]))
test_path = './test/'
ls_images_file = os.listdir(test_path)

brain_cancer_class_prediction(test_path, ls_images_file)

As we can see, out of the 15 images in our test sample, 14 were predicted correctly (only brain_menin_3996.jpg was incorrectly predicted), which confirms that our model performs well.

References

--

--

Brice Nicodem Simeu

Engineer, Data Scientist & ML enthusiast. Combining mathematics and computer science to exploit data is above all a passion for me.