Does CNN learns modified inputs?

The portuguese version of this article is available in A CNN aprende entradas modificadas?

The CNN (Convolutional Neural Network) has appeared in order to get better accuracy on image recognition problem. Its success is due to two main features from its architecture: three dimensional inputs and convolutions.

LeNet5. The first successful CNN created by Yann LeCun in 1998 used as digit recognition.

The 3D inputs make the model more realistic, since the images have width, height and deepth (representing the color channels, like RGB), and the convolutions, make it more efficient, since the number of operations and parameters of a filter is much smaller than a fully connected layer.

We can interpret each layer as being a feature that the network could extract from the image. The first layers are the most basics features, such as edges, borders and geometric shapes.

Filter representation from the first layers of a CNN

However, are the inputs dependents on each other? That is, if we modify a channel (invert the green channel, for example) of all input images, does the network will be able to learn with the same accuracy and speed?

On the one hand, we may think that the learned filters channels will be equally modified, that is, if we mirror the channel horizontally, the learned filters will also be mirrored horizontally.

On the other hand, it is hard to imagine what features the network will extract from the image since that, when we modify a channel, we get a completely random image, with many coloured blobs along the image.

Motivation

Most likely, this problem will never occur in a image recognition network, however, there ar? some applications where this is possible. For example, imagine a network modeled to make soil mapping. On inputs, each channel would represent a specific soil property (pH, nutrients concentration, etc.), where each element (equivalent to an image pixel) represents a soil area.

However, for many reasons, one element from each channel may represent a different area, making the channels to have different sizes.

Since a CNN only accepts inputs whose channels have the same size, to train this network, should we resize each channel so the information of each location is "aligned", or it can learn the information from each channel independently?


Teste

To test, it was used a simple network to recognize dogs on the Open Images dataset.

On the left, the original image of a dog, on the right, the same image with green channel horizontally mirrored and blue channel vertically mirrored

We can observa a summary of the network used below: CONV -> RELU -> MAXPOOL ->CONV -> RELU -> MAXPOOL ->DROPOUT -> FC -> Softmax

Summary from the network

We trained the same network twice, once with the original images and the other with the modified images (the modification was horizontally mirror the green channel and vertically mirror the blue channel). We obtained the following results:

We can observe that, although both models have practically learned with the same speed and accuracy on training dataset, on validation dataset (which is where we really should analyze), the original images had a better result: 74,3% versus 69,3%.

Furthermore, it is possible to see that the accuracy on validation of the modified images is slightly going down, indicating that overfitting may be occurring. We can see it also by analyzing the losses:

We can see that the loss on training keeps decreasing, but at validation it is increasing. This effect is much more visible on training with modified images.

Justificative

In order to try to explain why this happened (the network can learn even with the modified images but with a tendency to overfit), let's recall some concepts of convolution.

The convolution filters, as well as the CNN inputs, have three dimensions, where the depth is equals to the depth of the previous layer. Therefore, the first layer filters of the CNN have the same depth of the inputs (in case of a RGB image, the depth would be three).

Example of a convolution filter

During convolution, the filter slides through the width and the height of the previous layer, making an element-wise multiplication at each step.

The fact that the filter has three dimensions and each channel of the filter multiply a same channel of the previous layer, creates a certain independence between the channels. Thus, each channel of the filters of the first layer learns according to what is seen in the respective channel of the input image.

However, by modifying the channels of the image, there was the impression that too much noise was added to the images because as the filters learn features of the inputs (such as edges and borders), with the modifications, such patterns can not be easily found . And this made the overfitting effect even greater.

Conclusion

Because it is a very simple architecture, and a data set size is not very large (27000 training and 18000 validation), we can conclude that a network can learn even with modified channels with a much lower accuracy (approximately 5% in this case). However, we should be very careful with these modified data, as they have a greater tendency to overfit.


References:

- Deep learning and Soil Science — Part 2: https://towardsdatascience.com/deep-learning-and-soil-science-part-2-129e0cb4be94
- Convolutional Neural Networks (CNNs / ConvNets): http://cs231n.github.io/convolutional-networks/