How the nature of the images affects on deep learning models learn from them

12 min readJul 6, 2020

A PyTorch: Zero to GANs course project blogpost

Image classification is an interesting and widely used application of deep learning. There are many image datasets, and even more techniques used for implement image classification can be found on the Internet. In this project my goal wasn’t to build the world’s greatest classifier, instead, I tried to detect, how the nature of an image can influence the efficiency of a classifier.

In this project I used PyTorch, which is a Python-based scientific computing package.

The data

The dataset I used in this project is the Intel Image Classification Dataset, what I found on Kaggle. It consists of around 25 000 images, which we can classify into 6 unique classes, which are the follows: buildings, forest, glacier, mountain, sea and street.

After I loaded the images, I had some trouble with them. I ran some diagnostics, and found out, there were some problematic images, which didn’t share the same dimensions as the others. 48 images in the training set, 7 in the validation set and another 14 images in the test set had different sizes than 150 × 150 pixels, so I had to resize them.

Some images from the resized, otherwise untouched dataset

Like I said, I wanted to train some models with somewhat different images, so I created some more slightly different datasets from the original.

Determining the mean and standard deviation of the images

First, I determined the mean and the standard deviation (std) of the images, so I could normalize them. This means we subtract the mean, and divide them by the std across all three (RGB) channels. As a result, the mean of the data becomes 0, and the std becomes 1. That means the data distribution resembles to the Gaussian curve. It can help the model for quicker convergence.

The transformation settings for the normalized datasets

Some of the normalized image dataset

The normalization can be used by itself, but we can apply other different transformations on the datasets, which can have a positive impact on the training process. This is called data augmentation.

We can apply randomly chosen transformations on the images while loading them from the dataset. Since this transformations appear randomly on the images, the model sees slightly different inputs at each epoch — and it’s a good thing for better generalizing.

Transformations on the raw dataset

In this case I normalized all the images, then added some randomness to them. Some images may have been flipped horizontally, some of them may have missed some areas, and others may have moved a bit in a random direction (see RandomCrop).

That 25 000 images could been found in separated folders. There was a training set with around 14 000, a train set with 3 000 and a prediction set with 7 000 images.

Since there wasn’t a dedicated validation set, I used the random_split function to separate 20% of the dataset for this purpose, and the remainder became the training set. The prediction set with its 7 000 images didn’t contain any labels for the images — the dataset was a part of a Kaggle competition — so I couldn’t use that.

Split the dataset into parts

For the modded dataset I had to trick a little. Since I modified the whole training dataset before separate the validation set, I had to use the normalized validation set. Because I didn’t want any overlap in the modified training set, and the normalized validation set (what we would use when we trained the model with the modded dataset), I used fixed random_splitting with the same randomness — in this way I got two datasets exactly as I wanted.

Creating the train and validation sets for the modded and normalized learning

Some images from the modified dataset. We can clearly see the erased areas, and if we look closely — really closely — we can see the reflected sides on some pictures, though it’s only maximum 4 pixels wide, and the images are 150 × 150 pixels sized

After I had that many colorful datasets, I got curious, how colors affect the learning of a model — so I created a grayscaled dataset from the original one. I know, there must be a dozen build-in or third-party functions for this task, what do its job faster, maybe more accurately — this procedure was one of the few image operations I remembered from school, so I wrote my own one to do it.

Creating grayscaled image tensors

Some grayscaled images. From the looking of the pictures I think the the result is better, than the operation’s speed was

Before I could start the training, I had to check the distribution of the images — it wouldn’t help the learning phase, if the majority of the images were buildings alone.

The distribution of the images in the different classes

Luckily it wasn’t the case, the images were well-distributed, ready for the training — so let’s move to the next step.

GPU

The GPUs can much more faster carry out some kind of tasks, and matrix operations are one of them. Since in this project the most calculations were related to matrices, if we had the opportunity, we should have used GPUs instead of CPUs. Since Kaggle provides us 30 hours GPU usage in a week, I had to move the data to the GPU. That moving mission was defined by the cell of code below.

Some functions, what eased the moving processes

If we don’t have a GPU — or we don’t always have one — we can still use this code. It search for Cuda GPU, and if finds one, will set the default device to the GPU — if there’s not Cuda GPU the CPU will remain the default device. If so, there won’t happen anything.

Moving the data to the GPU — or the default device

Models

The next task was define some models we would use. Since my aim wasn’t building the most accurate predictive system, there are just some simple neural networks.

The first model (Model 0) is made of fully-connected layers. The input layer receives the input pixels — notice that the input number of pixels varies according to the size of the input images. Since we had grayscaled and colorful images as well, all model has a modified version, what can get the grayscaled batch of images, the models on the sketches are made for RGB images.

All the models have ReLU activation function between the different layers.

The convolution steps are all made with the kernel size of 3×3, and I always added a padding with the size of 1 pixel wide.

The Max-Pooling layers I used halves the pixel width of an image — so the number of pixels in one channel gets the quarter of the one in the previous layer. There is one exception, in the Model 1 I used Max-Pooling with the stride and kernel size of 5×5.

The source codes of the models can be found in the project code I share at the end of this writing. The last model (Model 6) is the ResNet9 residual convolutional network we often used in the course.

ResNet9 a.k.a. Model 6

Training

For training I needed a function, that could numerate the correctly predicted images, the function named get_num_correct did the trick. Important detail, that this function counts the correct labels, and not the accuracy (in other words it gives us quantity, and not quality).

How many correctly predicted images are in the predictions

The training method I wrote asks for some parameters. These parameters are the model, we want to use (I defined them above); the number of epochs; the learning rate, which specifies the updates’ speed of the weight matrices with modifying the step size. Then we have to give the function a training and a related validation dataloader. Besides we have two optional parameters: the optimizer and loss function we want the process to use. For default values I picked the SGD (Stochastic Gradient Descent) as optimizer, and the default cost function is the cross-entropy. The batch size of the images were always 64.

Inside the function, first I printed out the used model and optimizer. After that, before the training begins, a method calculates the loss and accuracy for the training and validation set (note, this will match the metrics we would get if we randomly chose a label for the image), then puts these values into their lists as their first element, and prints them out as well. Then we can start training the model, with the updates of the weight matrices. After every epoch there is a calculation for the actual metrics, which we store in the list, and print them out at the end of every epoch, so we could follow up the process.

The training function I used in this project

After the last epoch, when the procedure finishes, it returns with the loss and accuracy lists for the training and validation sets.

After we defined everything we needed, we can start the training. First, we substantiates a model on the GPU (or the default device — remember, all the data have been moved to this device), then we can call the function with the chosen arguments. I also used the %%time build-in function there for timing.

Creating the models on the default device

I created a total of 27 models (7–7 models for all the 4 types of data — I only haven’t created a ResNet9 model for grayscale images). The training time varied from 1 minute to 25 minutes — although the 1 minute rides were outliers caused by the grayscale sets, the average running time was around 20 minutes per training.

Every training were pulled off with the same hyperparameters.

I trained every model for 20 epochs, the learning rate was 10e-5. For optimizer I used the torch.optim.Adam optimizer, and the loss functions in all cases were cross-entropy loss functions.

Results

After I finished with all the training, and had all the values, I could start the evaluation.

I won’t present here all the results, that would be quite boring, but the results are similar for the most models. The following graphs represent the accuracy and loss results from the training with the model Model 3.

Validation and training loss through 20 epochs at training Model 3

We can see how the training and validation loss decrease during teaching. As most often the worst result was achieved when I used the grayscaled training set. Training with the normal RGB images gave better results, but the lowest loss was reached with the normalized and modified images.

Validation and training accuracy through 20 epochs at training Model 3

We can see here an interesting phenomenon. While in the training set’s results (both the loss and accuracy) the award goes to the normalized dataset, at the validation set’s graphs we can observe a difference. At validation the training with the modified set wins. How is it possible?
As we discussed at the beginning of this blog, we use data augmentation to achieve better generalization. The training loss is higher on the modified set (so performs worse than the normalized one), still, the training with the modified images gives better generalization, so the accuracy on the validation set (which was a normalized set, remember how we set them up) are better.

I averaged the results by datasets and by models, here are the results.

The mean validation accuracy by datasets

We can clearly see, on average the grayscaled image set performed the worst, while the training with the modified set is just ahead of the one trained with the normalized set. The RGB set doesn’t have to be ashamed, it didn’t perform badly neither.

Mean validation accuracy by models

As for the models, the ResNet9 finished in the lead, and the one contains only fully-connected layers was the least effective.

Since the models trained with the modified images, and the Model 6 did their job best, let’s see how they worked on their own!

How the models learnt under the data of modified images

We can see in the figure above, on the modified images (which gave us the best overall results) the ResNet9 learnt at the best rate, as an average.

The Model 6 (ResNet9) learning curve with the different datasets

Though I didn’t test the model we used in this course (ResNet9) with the grayscaled images, with the color images it was the same as the average: with respect to the training data, training with the normalized and untouched RGB images gave us the best accuracy, and almost zero loss — at the validation phase we could observe, although the one trained with the modified images had the most loss w.r.t. the training data, in live use the model trained by the modified images was the best.

We can be happy when we see high values for the accuracies (and low for the losses), but we don’t want to count our chickens before they hatch. We should be sure, that the model performs well with never seen data. That is why we always should double check the results with the test dataset. If then it gives us a value we satisfied with, we may move on, maybe try it out with real examples.

I will not go into details here, the values I measured on the test sets are about the same, as on the validation sets. (The best result was 89.3% with the Model 6 (ResNet9) trained with the modified dataset, the lowest correct hit was achieved with the Model 0 (only fully-connected layers) (1180 / 3000) trained on grayscaled images.

Let’s predict

After I finished with the inquiry, I wrote a simple function to predict some images on the test set (I tried it on the prediction set too, but it obviously wasn’t that fun, because I didn’t know the correct labels).

With the use of the test set, we can pick some images for testing the model

We give an image to the function, which shows us the image in question, then draws a graph about what the model ‘thinks’, what the image represents. The correct label’s probability is marked with orange color.

Some images the model predicted well (the correct label marked with orange)

Those images’ labels are all predicted by Model 6 (ResNet9) trained with RGB images. The predict accuracy is about 88.7% (it’s a simple accuracy, I gave a point when the model labelled correctly, and didn’t, when it made mistake).

Now let’s see some images, where the model was incorrect!

Some images the model predicted wrongly (the correct label’s prediction marked with orange)

Well, with human eye, I could see, what’s went wrong, I’m not sure, I could correctly distinguish most of them neither. The most mislabeled prediction came from that type of misunderstanding, there were only a few inexplicable errors, see the image prediction at the lower left corner — the model could use some more training.

But where the models make the most mistakes? For answering that question, a confusion matrix can help us. (I got the source code for the confusion matrix from DeepLizard.com.)

Confusion matrix of the predictions

As we can see, the model mostly mixes up the buildings with streets, and mountains with glaciers, but for example it can almost always guess when it ‘sees’ a forest. It seems understandable, a street contains some buildings, while there is snow at the top of the mountains as well as in a glacier, and what is as green as a forest? If we use multilabels, majority of these errors would be eliminated.

And, of course, it can get wrong, because it got some doubtful labels from the training dataset.

Conclusion

I built image classifiers with different models and with slightly modified teaching datasets. I modulated the images’ color channels with normalizing them or convert the three channel to only one (convert RGB to grayscale); added some rotation, dislocation or even erased some part of them.

I built models from the simpler model containing only fully-connected layers, through CNN models, ended up with a residual neural network.

In all cases the three color channel was an advantage to the training process compared to the grayscaled ones. From the models I trained the residual network achieved the best results with the modified images.

Although the model didn’t learn as fast with the modified images as with the simple RGB or normalized ones, on the validation set (and later as well on the test set) gave better result due to its better generalizability.

However I have to mention these results are true in these cases, we shouldn’t generalize on this basis.

Thank You..

I would thank the whole Jovian.ml team for this PyTorch course, why this blog was born, it was really great, I recommend it anyone who would learn machine learning.

I would also like to thank my friend Gyula, whom I could have bothered with my funny questions.

Also, I really appreciate it if You got this far, I hope, You didn’t regret it.

References

I share my jovian notebook here, I try not to change that later, though it’s really confusing somewhere.

https://jovian.ml/attila-balogh/course-project-presenter