How to deal with image resizing in Deep Learning

This post continues the post published by Infosimples in 19/oct/2018: https://medium.com/infosimples/does-cnn-learns-modified-inputs-bc16ae1be498

TL;DR: The best way to deal with different sized images is to downscale them to match dimensions from the smallest image available.

If you read out last post, you know that CNNs are able to learn information from images even if its channels are flipped, over a cost in the model accuracy.

This post studies a similar problem: suppose each color channel has a different size. Which are the best ways to train an image classifier in those circunstancies?

First, let's create a simple model to serve as base for some comparisons that will be made in this article:

Layer                        Output Shape              Param #   
=================================================================
InputLayer (None, 100, 100, 3) 0
_________________________________________________________________
Conv2D (None, 100, 100, 32) 896
_________________________________________________________________
MaxPooling2D (None, 50, 50, 32) 0
_________________________________________________________________
Dropout (None, 50, 50, 32) 0
_________________________________________________________________
Conv2D (None, 50, 50, 64) 18496
_________________________________________________________________
MaxPooling2D (None, 25, 25, 64) 0
_________________________________________________________________
Dropout (None, 25, 25, 64) 0
_________________________________________________________________
Flatten (None, 40000) 0
_________________________________________________________________
Dense (None, 128) 5120128
_________________________________________________________________
Dropout (None, 128) 0
_________________________________________________________________
Dense (None, 2) 258
=================================================================

It's a simple model, able to tell dog pictures apart from non-dog pictures, with only two convolutions. After training it for 10 epochs (using complete 3-channel images, 100x100 pixels), the results are:

The maximum validation accuracy value of 77.58% will be used as reference to the next experiments in this post.

Scaling techniques

We all know that an image loses quality when you apply zoom to it. When you put a small quantity of pixels in a screen with higher resolution, it is necessary to "create" new pixels, so they are able to occupy the holes that would appear. There are many techniques that can do this:

Original picture (160x160) — Nearest-neighbor interpolation — Bilinear interpolation
Bicubic interpolation — Fourier-based interpolation — Edge-directed interpolation algorithms

Each one of those images was downscaled to 40x40 and then upscaled back to 160x160, using each one of the scaling algorithms above. Although we lost a lot of the visual quality, we are still able to perceive that this is a shell picture, even if we have 1/16 of the information we had before.

And what about Neural Networks? Which upscaling algorithm is better for using? Or would we rather downscale the pictures? Let's put an end to this doubt.

Below, we have channel slices and combinations of them using different upscaling algorithms:

We can also test the following architecture, able to reduce bigger channels during training with convolutions:

Let's call this architecture “Multiresolution CNN”

The above architecture was develop with the idea that convolutions are able reduce the channels dimensions, while extracting only the most important features. You can check it in here:


After training the simple neural network presented in the beginning of this post with many upscaling techniques, we got the following accuracy rates:

Post-training results

If we take in consideration only the validation dataset accuracy, we can conclude that any upscaling technique is inferior to downscaling images to the size of the smallest one. The best thing to do in this case is to just downscale the pictures to match the smallest channel dimensions.