CNN — Do we need to go deeper?

Published in

FiNC Tech Blog

8 min readApr 28, 2017

In the last decade, deep Convolutional Neural Networks have consistently been achieving state-of-the-art performance in object recognition tasks, including classification, detection and segmentation. We need only look at the results of the ILSVRC challenge: since 2012, all the winners are CNN! Without going into the details of their architecture (a whole blog post would be relevant enough for that), the main improvement we can notice is mainly their deepness. Let’s introduce them and their main characteristics briefly.

ILSVRC Challenge results

2012

Alexnet, released by Alex Krizhevsky, popularized CNN in computer vision. Its architecture is very similar to LeNet (introduced by LeCun in the 1990s for hand-written patterns recognition), but is a larger and more complex network, able to learn complex objects using 5 convolutional layers. It also introduced the use of ReLU as activation function, dropouts to avoid the overfitting, and max-pooling layers instead of simple pooling.

2013

ZFNet, the ILSVRC 2013 winner, is basically a new version of AlexNet, adjusting the number of hyper-parameters by increasing the size of the convolutional layers, and reducing the stride / filter size for the first layer.

Overfeat, also released this year, presents another extension of AlexNet including an algorithm that learns to create the bounding boxes around the objects, for detection and classification.

2014

GoogleNet, was the winner of this edition, introducing a new module: Inception. It significantly reduced the number of parameters the network have to handle (from 60M for AlexNet to 4M). It also removes the fully connected layers, usually producing many additional parameters.

VGGNets, also are two great competitors of the 2014’s competition: VGG16 and VGG19 (16 and 19 convolutional layers), from Oxford. They are innovative by proposing a sequence of convolution, and some new CNN configurations.

2015

ResNet was the winner network in 2015, presenting a “skip connection” in the network, allowing to feed the output of a sequence of convolutional layers with its input. Combined with a combination of 1x1 convolutional layers to reduce and then increase the number of features, this network can be extended to an incredible number of layers: the current best version has 152 layers.

2016

CUImages is the most recent winner. The main improvement of their network(s) is the “Ensembles” approach, consisting in merging the learnings of multiple networks into a single one. Using this, they were able to merge a Gated bi-Directional network to a Fast RCNN for doing both detection and classification.

Evolution of depth, error-rate, and number of parameters over the years

From AlexNet, the first efficient CNN, which have 5 convolutional layers, to CUImage, the winner of the 2016’s edition, which uses a network with 269 layers, we definitely can speak about a real revolution of depth. Even Google’s paper about its Inception module is called “Going deeper with convolutions”!

Why do we keep going deeper?

So the first question is: why do we have to go deeper?

The first reason is that a deeper model will convolve more the input data. When a network do a convolution on an input, it extracts a relevant feature (mostly the edges, shapes, colors, etc). Allowing the network to perform more convolutions let it extract with more precision the features it “judges” relevant according to the dataset.

The second main reason is because we can. The recent advancement in computing technology allowed to train much deeper and complex networks than in the past, thanks to fast GPU technology. A deep network intuitively leads to a large number of parameters, and most of the current best models were impossible to train a few years ago (lack of memory, too long, …).

But in practice, the deepness of an architecture also have some drawbacks. Mainly, a network too deep, in a way, is too good. Let’s take as an example ResNet. One of the purpose of He et al.’s paper was to identify the efficiency of the depth in a deep learning network. Introducing the notion of residual networks, they were able to extend a simple model such as VGGNets without extending too much the number of parameters. Then, they tried four different extensions: ResNet-50, 101, 152 (which respectively have 50, 101 and 152 convolutional layers) and an aggressively deep version with 1202 layers. If the accuracy overall increased among the first three models, the last version did not bring significant improvements. On the other hand, the error-rate on the training-set almost reached 0%. Then, these results tend to show that the deepness of a network may make it learn “too well”, (also called overfitting) and may not be the most efficient way to have a generalizable model.

Many recent researches have tried to deal with these issues to keep going in depth and / or to solve this overfitting problem. The inception module used in GoogleNet have been upgraded three times to Inception v4, a complex version which now includes residual notion. Another module also consists in a multi-residual network, parallelizing small residual networks. Some other works (such as CUImages presented above) are building an Ensemble of models: they all learn to do the same thing separately, then, they are merged into a single model, called an “Ensemble”. This method seems good for generalization because it merges distinct learnings, so distinct features.

All of these methods are obviously interesting, and they all improve the current accuracy, sometimes the speed, of the previous best models. But my thought is that it shouldn’t be the way to do it. Building a new module that would be the new unit of our deep CNN is interesting, but this won’t lead to any significant improvements. Even with more layers, or more powerful layers, our network won’t be able to do anything more than “connecting” a set of features to a label included into a given set of categories.

An alternative to deepness

Another solution to improve the accuracy and generalization without changing the model is, obviously, to increase the amount of data. But it is not that easy…

It is the most intuitive solution, but it is also really expensive in time and / or money. Furthermore, it would be completely wrong to consider that we may have, one day, enough data to train a model that would be perfectly able to detect objects with a 100% accuracy, since it exists an infinite diversity of possible scenario (and so, an infinite quantity of possible data) in the real world.

A good point here, is that we spent the last two decades, and even longer, to collect all the data we could. Let’s keep the example of the computer vision: if we want a model able to recognize a bird with a 100% accuracy, we would be glad if it would be able to recognize a bird in any of the birds pictures on Google Image or Pinterest right? With all these images, it should be pretty much easy to detect anything in many situations.

But it practice, even these two platforms are not even good enough, because of –again- the huge complexity of our world. These data are too noisy and depend on too many distinct contexts for our current models, limited to a fixed number of categories.

So what would be another solution? Generating these data. If paying someone to label our data is too expensive in time and money, as we said before, we still can ask to a computer to do this. Well, being able to feed our model with data generated by another model would actually be the best solution, but to do this, we would need a model able to label the images by itself… And that’s what we are building.

It’s not easy to solve this tautology, but a first imperfect method exists and is commonly used: the data augmentation. Instead of generating the label for an image, it consists in using our currently labeled dataset, and to apply some basics modification on it (cropping, flipping, etc..). But it is not applicable to everything (rotating an image with the number 6 would give the number 9). Furthermore, it would increase slightly the accuracy, but won’t extend infinitely the dataset. Another method may be to build a really complex, slow and heavy model to generate data with high confidence (using the Ensembles method presented above for example) to generate these data. But again, this would limit the results to slight improvements, that may tend to converge really early.

So what should we do?

Well, this is the hard part!

To sum up, my thought is that classification, detection or segmentation should not be done the way we do. A model, even really good, taking advantages of Inception, residual networks, and next innovations, won’t be able to be flexible to new situations and to generalize as we do as humans. At best, it would be perfectly able to find the label for any object belonging to one of the n known categories in a picture, to locate it, and even to segment it. For that specific task, it would be better than us. But let’s consider the following picture:

This picture is really fun for us, humans, but is really complicated for an AI. Detecting humans would be easy here, detecting the scale too. Some actions can also be identified (people laughing, man leaning in on the scale), but would require a larger dataset. Being able to recognize people (Obama here for example) would also require a much larger dataset, etc. Detecting all of this within a single model is a completely different challenge. Andrej Karpathy explained this in details really well a few years ago in his thesis, listing most of the steps an AI would have to do to understand why this picture is funny for us. He also concluded by saying that we are basically really, really far away to build a model able to understand it as we do.

It is complicated to say what is the direction to take, but in my opinion, the biggest lacuna of our current computer vision models is the fact that they are not abstract. In the above picture, they would be able to detect the features “building” a person, so they would guess there is a person in here, but they won’t see it as an entity that can interact with the rest of the picture.

The computer vision as I see would have -built- its own conception of the world, and would be able to reproduce an image in this virtual environment to understand what is it about, what it was before and what it can be after. It would be able to detect that a new unknown object O1 (let’s say an ottoman) is physically really close to an object O2 (a ball), but contextually close to an object O3 (a chair, because people used to sit on it), etc. Building a map of relations like this would build a more flexible architecture, able to learn unknown things. And would even be able to build the world in a way we never thought about.