Image Classification with ResNet (PyTorch)

One secret to better results is cleaning data!

10 min readOct 17, 2022

The aim of this article is to experiment with implementing different image classification neural network models. I will explain some of the best practices I learned during this exploration so you can train an accurate image classifier too!

I will use the images from Kaggle’s Rock Classification Dataset by:

Cleaning the image dataset to improve training quality
Augmenting images to fix class imbalance
Training and comparing ResNet and VGG models

You can find the full code on my GitHub repository, written in Python.

Cleaning image dataset

First, I ensure all the images within the dataset are image files which can be read, so we don’t run into any problems in the later analysis. The original folder has one additional layer which I manually removed:

Metamorphic: Marble, Quartzite
Igneous: Basalt, Granite
Sedimentary: Coal, Limestone, Sandstone

Second, I take a look at the sample of images available for each rock type to gain an understanding of the type of images the model will need to work with. After browsing the images, I noticed several challenges with the dataset which will require some preprocessing:

Questionable data quality — some images are not of rocks at all or I am doubtful that they are images of the stated rock type
Not all images are photographs, some are illustrations
Some images are combinations of several different images which might have been better split into individual images
Duplicated images (distorts the dataset’s rock type distribution)

Sample coal rock images. [Image from author]

With a dataset of this size, it would not take an absurd amount of time to manually go through all the images and select those of the correct quality. However, since this project is a learning opportunity I will attempt to algorithmically remove problematic images.

Removing duplicate images

From observation, limestone has many visible duplicates, so I calibrated the sensitivity of the algorithm to remove duplicates using this. Then, I display all duplicates I remove to minimise errors. This is possible because the dataset is relatively small, but worth it because the task is menial.

Images are a 2D matrix of pixels. The colour of each pixel is represented by three numbers which denote the percentage of red, green and blue. A duplicated image will have the same pixels as the original image which means we could find the difference between each image, pixel by pixel. If two images have a very low difference, we can say that they are duplicates. This method is called Mean Squared Error (MSE) because that is exactly what is being calculated between each pair of images.

Pairwise MSE for all images in limestone. [Image from author]

The MSE is close to zero for most pairs of images most likely because many of them have a white background and limestone rocks are similar in colour.

If I take the MSE of a pair of images I know is a duplicate, I can have a benchmark for how the MSE should be for a duplicate image.

I identified a pair of duplicated images and got an MSE of 0. [Image from author]

The following two images are the same, but the scaling is different which increases the MSE. Based on this MSE, I will benchmark the threshold MSE for a duplicate image to be within the same order of magnitude.

threshold = 0.001

I identified two images of different scaling (250, 250) and (500, 500) which gives an MSE of 0.000771

Using this threshold, I removed 39 images.

Comparing other methods of computing image similarity

There are other image similarity algorithms I could have chosen that better capture perceptual similarity. For example:

- skimage structural_similarity

- OpenCV’s findHomography

However, the duplicates which concern me in the dataset are straightforward, literal copy-pasted images. Images that sort of look the same are fine I will perform some image augmentation before model training anyway. The goal of removing the duplicates is to have an accurate representation of rock type distribution, so we might better decide how to solve class imbalance.

Since I introduced two other image similarity algorithms, I will compare the performance skimage’s structural_similarity algorithm with MSE.

Pairwise similarity for all images in limestone folder. [Image from author]

The duplicated images give a similarity score of 1 while the rescaled images give a score of 0.957. I used the second score to benchmark the threshold at 0.9 which identified the same duplicate images!

The rock type distribution does not change drastically after removing the duplicates. We have an imbalanced data set with a significantly lower number of images for basalt and granite.

Distribution of sample images for each rock type before and after removing duplicates. [Image from author]

Manual cleaning

I will remove the rest of the images manually. Here are the image paths I will remove and some reasonings behind my choice.

Coal: statues, parts of homes, parts of buildings, old photographs,
Quartzite: Rock is only a small portion of image (ie: text, other objects), images of cells, houses.
Sandstone: Buildings

I was not sure whether to remove images with text at the forefront, but there are similar types of images in all the folders, so I erred on the side of caution and chose to keep them.

The distribution doesn’t change drastically here either but at least the data quality will be better.

Distribution of sample images for each rock type before and after manual cleaning. [Image from author]

Image Augmentation

As we want the model to learn to discern patterns from a diverse number of images, we will increase the number of images by augmenting them using a combination of the following:

Randomly rotating images
Converting images to grey scale
Randomly cropping images
Adding blur
Changing image saturation and brightness
Flipping images

The basalt and granite rock types, which have the lowest number of images, will have roughly 4 augmented versions of each image. This number of copies is reasonable considering the different augmentations we will apply.

Example of how an image might be augmented. The image on the left is the original. [Image from author]

Train, validation and test data

First, I will hold out 20 images for parameter tuning (validation) and 20 images for testing the model’s accuracy. Then, I increased the sample size for each class to match the class with the largest number of images (442). Now there is no more class imbalance, but many similar images in classes with initially small samples.

Number of images for each rock type in the test, train and validation set. [Image from author]/

Training Image Classification Model

After resizing and cropping all images so they will be the same size, I loaded the training, validation and testing data.

Testing image classification models

There are several image classification models out there which were trained on various datasets and built with different architectures. Each model has embeddings for identifying features of images which help classify images into specific classes. I will fine-tune these models at their last layer to output the probability of each image belonging to a specific class. There are multiple versions that denote the number of layers for that model. More layers mean better accuracy but also longer training time. I have also set it to 5 epochs for time constraints.

I will compare the performance of ResNet and VGG.

ResNet

A convolutional neural network (CNN) is designed to scale the number of layers in deep neural networks. As neural networks gain more layers, we expect their performance to increase but it also increases the complexity of the model since we need to tune more parameters. However, the gain in accuracy drops as the model gains complexity. Instead, ResNet uses skip functions to learn the differences between layers, reducing the number of parameters. The result is a better-performing model with lower complexity.

I will use the ResNet version with 50 layers. This video does an excellent job of explaining the paper introducing ResNet if you are keen to learn more.

First, I will freeze the model weights for the top layers since we want to utilise the existing architecture.

Second, I will replace the final layer of the model with two layers to finetune it. (1) A fully connected layer to convert the outputs to the correct dimensions for (2) a log softmax layer for the probability of each image being categorised as a rock type. I use softmax because I want to get the probability of a rock being in each class rather than an integer telling me whether the rock is in a certain class or not.

Third, I will set the loss function and optimiser which define how the model will learn.

Loss function

The loss function defines the score for each training round. In linear regression, you will typically see the loss function as the residual or mean squared error which shows how wrong each prediction is. MSE penalises wrong answers linearly meaning, whereas the negative log loss function I will use does so exponentially.

For example, if an image’s true label is `Granite` but is given a probability y of being granite. Then, its loss score will be -log(y). The closer the probability is to the true score 1, the lower the loss is which means a smaller penalty. But there is a significantly higher penalty given as the probability moves away from the true score towards 0.

Change in loss score using a negative log-likelihood. [Image from author].

Optimiser

The optimiser determines how the model uses the calculated loss to learn. Once you know how good/bad your results are, what do you do next to find the lowest possible loss?

In a linear regression problem like mx+c=y, this might mean figuring out how much to change m and c to get the lowest possible MSE, preferably in the fewest steps. Typically, we set a learning rate that says how much we should change the weights each time. We might set an optimiser with a variable learning rate that makes bigger changes to weights if the loss is high and smaller changes if the loss is low.

I chose the Adam optimiser which is known for relatively fast convergence and is a popular choice in the field.

You can watch a more in-depth video explaining optimizers here.

Now that I’ve set up the model, I can train it. Typically, you would train the model until the loss plateaus which means that the model has converged on a minimum (not necessarily global) and will not improve. However, due to time constraints, I will do a maximum of 50 epochs.

Training and validation loss over each epoch on ResNet50 model with 50 epochs. The final validation accuracy is 0.621. [Image from author]

It is worth noting that the data cleaning phase made a *big* difference to our results. Before cleaning, the training accuracy remained around 45%-55% with 100 epochs. Now, the training accuracy beats that of another baseline on Kaggle with only 50 epochs!

If I trained this model for longer, it is likely that we will get a better accuracy since the training loss has not plateaued yet!

Training and validation loss over each epoch on ResNet50 model with 100 epochs. The final validation accuracy is 0.578. [Image from author]

VGG

A CNN pre-trained on over a million images from ImageNet, a database of 1000 categories of labelled images. I am using the version with 16 layers.

Similarly with the ResNet model, I will freeze the top layers and add two layers at the bottom to finetune the model. I will use the same loss function and optimizer as for the ResNet model and train it for 50 epochs as well.

Training and validation loss over each epoch on VGG16 model with 50 epochs. The final validation accuracy is 0.636. [Image from author]

The VGG16 results are less accurate than ResNet50 and its loss appears to be plateauing which is another signal that it does not perform as well as the ResNet50 model.

Testing

The moment of truth! I will test with the ResNet50 model.

There is a markedly better performance for the classes with more unique images (ie: higher initial sample size). This makes sense since it is trained on a wider variety of images which gives it more opportunities to pick out salient features for each class. Before I cleaned the images, I ran the ResNet50 model on a 100 epochs a which yielded a training accuracy barely grazing 60%. This model is about 5% better with 65% overall accuracy.

Classification report for the ResNet50 model on the test data. [Image from author]

The VGG16 model has comparable performance but does not show the same capacity for improvement with longer training.

Classification report for the VGG16 model on the test data. [Image from author]

So how might we improve the accuracy of the results?

The results show what a big difference cleaning the image data makes which really lends credence to the data science saying:

Garbage in, garbage out.

As always, there are ways to make the model better!

Source more images to increase training size
Increase the training size by adding augmented images for all classes
Experiment with adding more layers to the model
Add more epochs (increase training time) to find the minimum loss
Increase the depth of the ResNet model

In a future article, Eric Chang will build a mobile app to identify rock types from captured photos using this trained model. I might also write another article on applying improvements to the current model!

I hope this has given you a good idea of how you might build and train an image classifier. If you have any suggestions on best practices or ways to improve the current process, please let me know!