This post continues the post published by Infosimples in 19/oct/2018: https://medium.com/infosimples/does-cnn-learns-modified-inputs-bc16ae1be498
TL;DR: The best way to deal with different sized images is to downscale them to match dimensions from the smallest image available.
If you read out last post, you know that CNNs are able to learn information from images even if its channels are flipped, over a cost in the model accuracy.
This post studies a similar problem: suppose each color channel has a different size. Which are the best ways to train an image classifier in those circunstancies?
First, let's create a simple model to serve as base for some comparisons that will be made in this article:
Layer Output Shape Param #
InputLayer (None, 100, 100, 3) 0
Conv2D (None, 100, 100, 32) 896
MaxPooling2D (None, 50, 50, 32) 0
Dropout (None, 50, 50, 32) 0
Conv2D (None, 50, 50, 64) 18496
MaxPooling2D (None, 25, 25, 64) 0
Dropout (None, 25, 25, 64) 0
Flatten (None, 40000) 0
Dense (None, 128) 5120128
Dropout (None, 128) 0
Dense (None, 2) 258
It's a simple model, able to tell dog pictures apart from non-dog pictures, with only two convolutions. After training it for 10 epochs (using complete 3-channel images, 100x100 pixels), the results are:
The maximum validation accuracy value of 77.58% will be used as reference to the next experiments in this post.
We all know that an image loses quality when you apply zoom to it. When you put a small quantity of pixels in a screen with higher resolution, it is necessary to "create" new pixels, so they are able to occupy the holes that would appear. There are many techniques that can do this:
Each one of those images was downscaled to 40x40 and then upscaled back to 160x160, using each one of the scaling algorithms above. Although we lost a lot of the visual quality, we are still able to perceive that this is a shell picture, even if we have 1/16 of the information we had before.
And what about Neural Networks? Which upscaling algorithm is better for using? Or would we rather downscale the pictures? Let's put an end to this doubt.
Below, we have channel slices and combinations of them using different upscaling algorithms:
We can also test the following architecture, able to reduce bigger channels during training with convolutions:
The above architecture was develop with the idea that convolutions are able reduce the channels dimensions, while extracting only the most important features. You can check it in here:
After training the simple neural network presented in the beginning of this post with many upscaling techniques, we got the following accuracy rates:
If we take in consideration only the validation dataset accuracy, we can conclude that any upscaling technique is inferior to downscaling images to the size of the smallest one. The best thing to do in this case is to just downscale the pictures to match the smallest channel dimensions.
The full source code to this experiment can be found here:
The portuguese version of this post can be accessed here: https://medium.com/infosimples-br/como-lidar-com-redimensionamento-de-imagens-em-deep-learning-f4215b30f57a