Image Classification Architectures review

This blog post gives a brief overview of the Image classification Architectures evolved since AlexNet till SENet.

13 min readJul 19, 2018

This blog post assumes that the user has knowledge on Deep learning especially on topics like convolutions, pooling, activation functions and back propagation. The intent to write this blog post is to make users aware of the way neural architectures have evolved over the time and what too look for going forward.

Though we will look at results of image classification in this blog post, These architectures can be used as back-ends for object detection, recognition, segmentation also.

We will discuss about the following architectures

AlexNet (2012)
ZFNet (2013)
VGGNet (2014)
Googlenet (2014)
ResNet (2015)
ResNeXt (2016)
DenseNet (2016)
SENet (2018)

Some of the architectures like google net has evolved over a period of time with the introduction of residual functions in ResNet paper, I will mention these at those respective places. There are some architectures like squeeze-net developed to be mobile friendly, I will not be discussing about these as these are some trimmed down versions of existing architectures discussed above to make them faster.

Short note on ImageNet Dataset.

ImageNet is a dataset of over 15 million labeled hig-resolution images belonging to roughly 22,000 categories.
The images were collected from the web and labeled by human labelers using Amazon’s Mechanical turk crowd-sourcing tool.
Starting in 2010, as part of the pascal visual object challenge, an annual competition called the ImageNet Large-Scale visual Recognition challenge (ILSVRC) has been held.
ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.
Starting 2018, ImageNet classification challenge has deprecated as the present solutions reached human-level accuracy. The community is now focused on object detection and segmentation challenges which are still in nascent stages.
For people who are trying to collect their own datasets, it is often good to search the Imagenet database before jumping into any other sources. It is free.

Lets review the architectures now,

AlexNet [Paper Link]

Written by Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton, this paper garnered more than 22000 citations since its publication in 2012 and is termed as the most influential paper in Deep learning community.
It contains 5 convolutional and three fully-connected layers.
They have introduced ReLU non-saturating nonlinearity f(x) = max(0, x) and later on became gold standard choice for activation functions. The time required to reach similar accuracy using saturating nonlinear functions like tanh is 6 times faster.
Used GTX 580 GPU with 3GB of memory. The network is spread across two such GPUs. If you are implementing this architecture now, we need not worry about parallelization across GPUs as from 2012 the GPU architectures have evolved considerably and we can train this network much faster and using single GPU (Example, GTX780).
Local Response Normalization and overlapping pooling was introduced. Overlapping pooling uses stride less than the kernal size (In paper stride=2 and kernel size =3).
Network Architecture

AlexNet Architecture- We can now remove the sparse connections between two conv layers as we have enough GPU memories available now.

[Diagram 1] ->Param count and Architecture diagram

The network in total has 60 million parameters compared to only 1.2 million images for training, To prevent over fitting they have used Drop-out and Augmentation of input images. Dropout is used in between fully connected layers. Dropout roughly doubles the number of iterations required to converge

Results: Pre-trained on Imagenet(15 million and 22000 categories) and later trained on ILSVRC-2012 dataset, the network achieves an error rate of 15.3% and stood first in the challenge. The second team has an error rate of 26.2%.

ZFNet

Written by Matthew D.Zeiler and Rob Fergus from New york university this paper has got more than 3400 citations since its publication. They have made simple improvements in the nets. Gave an analysis on what the architecture is learning and the first to demonstrate transfer learning. We will look into each of these things.
They have used the same AlexNet with smaller kernel sizes (changed to 7x7 from 11 x 11) and strides (changed to 2*2 from 4*4)

Training took 12 days on a single GTX580 GPU.
The paper also talks a ton about how to visualize what the layers are learning using a technique called Deconvnet.

Result:This architecture when combined with 6 other similar models achieved a top-5 test error rate of 14.8%

VGGNet:

Written by Karen Simonyan & Andrew Zisserman this paper has garnered more than 10,000 citations ever since its published. This paper address the important aspect of convnet architectures design — depth.
Given image is passed through a stack of conv layers, where we use filters with a very small receptive field 3 * 3, the conv stride is fixed at 1. Spatial pooling is carried out by 5 max pooling layers. The spatial resolution is preserved after convolution by padding the image. A stack of conv layers is followed by FClayers. All hidden layers are equipped with ReLU. They have tested the network with different depths.

Using small strides significantly reduces the number of parameters also. For example, A stack of 2 3x3 conv layers has an effective receptive field of 5 * 5; three such layers have 7x7 effective receptive field. Now considering 3 stacks of 3x3 conv layers with C channels we get 3 (3²C²) = 27C² params, where as a 1 7x7 conv layer contains 7x7xC2 = 49C² params. Using stack of 3 conv layers also incorporates 3 non-linear rectification layers instead of a single one, which makes the decision function more discriminative.

Results:

A & B networks tests whether LRN is important or not, A top-5 error rate is 10.4% and error rate is 10.5% [LRN not required and removed in further experiments]
D & C networks tests whether 1x1 conv layer is important or not, with 8.1% and 8.8% top-5 error rate, They chose 3x3 conv layers for further experiments when testing very large networks.
D & E tests networks with 16 and 19 layers [which are still used in practice by deep learning engineers]. They have top-5 error rates of 8.1% and 8.0%
The authors also have tested with various filter maps [256, 384, 256:512] and found that increasing filter maps with depth has improved the performance of the model.
Ensemble of E model trained on various input scales of images[ 256, 384, 512] has achieved an error rate of 7.5%.
Using multiple crops of the image to the above case decreases the error rate to 7.1%
Following the above both cases to D model and ensemble with E achieved an error rate of 6.8% and stood 2nd in ISLVRC-2014 challenge.

GoogleNet [Paper Link]

Christian Szegedy from Google along with so many other researchers published this article which has garnered 7000 citations till to date.
This is independently developed in parallel to VGGNet.
By this time, people have come to an understanding that having larger nets will improve the performance of the model. We can increase size- either by depth (Number of network levels) or width (number of units at each level).
Depth substantially increases the parameters, and with limited training sets, this leads to over-fitting. Width increases the computational requirements [uniform increase in the number of their filters results in a quadratic increase of computation].

They have replaced a simple conv layer with an inception module. A inception module contains 3*3 conv, 5*5 conv, 3*3 max_pooling and 1*1 conv each actiing independently on the feature maps of previous layers. In the end all the feature maps are concatenated.
Back then with very less GPU memory, running 5*5 kernel across filter maps of 512 depth is very expensive and so they have used 1*1 convolutions to reduce the depth.

Caution: I haven’t understood the author intuition on how he came up with the idea of an inception module. Any reader understood the paper, please comment. For practitioners, just iterate your dataset on this network and check the performance.

GoogleNet is the winner of 2014 ILSVRC classification challenge with the top-5 error rate at 6.67%. They have used 7 different versions of GoogleNet to achieve this.

ResNet (Residual Networks) [Paper Link]

Written by Kaiming He and his team at Microsoft, This paper has garnered nearly 8000 citations till to date and has been one of the most influential paper in Deep learning community. Residual mappings (Which we will later see) has been a key function in make deep neural networks work.
ResNet-34 achieved a top-5 validation error of 5.71%. ResNet-152 achieves a top-5 validation error of 4.49%. An ensemble of 6 models with different depths achieves a top-5 validation error of 3.57%. Winning the 1st place in ILSVRC-2015
I have written a two series blogpost on explaining ResNet [Here]. I would suggest the reader to go through the link for detailed understanding of how the authors have discovered this.ResNet won the ILSVRC-2015 classification.

By this time, Researchers thought that increasing more layers would improve the accuracy of the models. But there are two problems associated with it.

Vanishing gradient problem — Somewhat solved with regularization like batch normalization etc.
The authors observed that adding more layers didn’t improve the accuracy. Also, it is not over-fitting also as the training error is also increasing.

The basic intuition is that, at each conv layer the network learns some features about the data F(x) and passes the remaining errors further into the network . So we can say the output error of the conv layer is H(x) = F(x) -x.
Now when we increase the number of layers say from 18 to 21 layers. In the worst case scenario, even if the layers 19, 20 and 21 doesn’t have anything to learn, i.e. having H(x) = 0 should give us the same accuracy for both the networks. But this didn’t happened
This intuition motivated the authors to use residual function as shown below.

Now If the deeper (21) variant extra layers really doesn’t have anything to learn, they output zero (F(x)) and the input to the layer will be the output to the layer (x). F(x) is a set of conv layers, So the reader might ask why didn’t we set F(x) =x instead of using a identity mapping. It is because it will be very difficult for a set of non-linear layers to come out with same input (x) rather than 0. In simple terms, we can easily find a function F(x) = 0 than F(x) = x when the function is a set of non-linear layers.
The vanishing gradient problem is also solved by this problem and is kind of explained here (Read only the identity mapping section).
ResNet variants like ResNet50, 101 and 152 are now used extensively in the community because of their implementations and pre-trained weights available in all the frameworks.
ResNet is also used as the backend for object detection and segementation. Facebook Detectron has made available lot of pretrained weights on the same.

ResNeXT[Paper Link]

Written by Kaiming He and his team at Facebook Research, this paper has garnered more than 200+ citations and has been an important architecture in DL practitioners.
ResNeXt won 2nd place in ILSVRC 2016 classification task and also showed performance improvements in Coco detection and segmentation challenge than their counter part ResNet. It’s single model achieved 3.7% top error rate on Imagenet challenge, an ensemble achieved 3.03% error.
The paper is simple to read and introduces a new term called cardinality.

It follows a split-transform-aggregate strategy and all the paths contain the same topology.
It kind of looks very similar to an inception module but unlike inception module, the resnext topology is uniform across the depth.
I have written a detailed blog post on this here.

DenseNet[Paper Link]

The name refers to Densely connected convolutional network, written by Gao Huang, Zhuang Liu, Laurens van der Maaten and Kilian Q. Weinberger, the paper garnered around 700 citations.
By this time around, Deep learning kind of started as replicating human vision has turned into an engineering practice, people are trying different things looking at the accuracies, network weights and visualizing features. There was a saying around this period, that NIPS conference paper acceptance has been degraded. Researchers are keeping lot of effort in engineering these networks instead of trying something different. Having said all of this, this paper is still a good read and has improved the accuracy of image classification challenges.
DenseNet is very similar to ResNet with two important changes.

Instead of adding up the features as in ResNet, they concat the feature maps.
Instead of just adding one skip connection, add the skip connection from every previous layer. Meaning, In a ResNet architecture, we add up (l-1) and (1) layer features. In DenseNet, for Lth layer we concat all the features from [1….(l-1)] layers.

The author quotes “DenseNet layers are very narrow (eg 12 filters per layer) adding only a small set of feature maps to the “collective knowledge” of the network and keep the remaining feature maps unchanged- and the final classifier makes a decision based on all feature-maps in the network.”

Each block contains a set of conv blocks (Called bottleneck layer if contains a 1*1 conv layer to reduce features followed by 3*3 conv layers)depending on the architecture depth. The following graph shows DenseNet-121, DenseNet-169, DenseNet-201, DenseNet-264.

Growth rate defines the number of features each dense block will began with. The above diagram uses 32 features. Having this kind of architectures reduces the parameters we train by a lot. For example a resnet conv block may contain [96, 96, 96] feature maps inside a block, where as DenseNet contains [32, 64, 96]. The author claims and has been proven by some other paper called stochastic depth network (where they train a ResNet architecture by dropping a few layers randomly every iteration) that most of the features learned by ResNet are redundant. The author basic intuition is also that connecting layers in this way where each layer contains the features of all the previous layers, will not allow it it learn redundant features.
The transition layers in the network reduces the number of features to \theta x m, where \theta takes values between (0, 1). \theta x m is also used in bottleneck layers. when \theta <0 for both transition layers and bottleneck layers it is called DenseNet-BC, when \theta <0 for only transition layers it is called DenseNet-C.

Results:

It uses 3x less parameters compared to ResNet for similar number of layers.
Using the same set of parameters used for ResNet architectures and replacing the bottleneck layers of resnet with Dense blocks, the authors have seen similar performance on ImageNet dataset. On CIFAR-10, CIFAR-100 and other datasets, DenseNet blocks have shown incremental performance.

SE-Net[Paper Link]

Written by Jie Hu, Li Shen and Gang Sun is a wrapper to lot of existing networks like ResNet etc. we will discuss the wrapper soon. Released recenetly this paper already garnered 70+ citations.
An ensemble of SENets that employed with standard multi-scale and multi-crop fusion strategy to obtain a 2.251% of top-5 error on the test data in ILSVRC-2017 classification challenge.
Till this point, everyone was concentrated on stacking layers in different ways to improve the accuracy. One possible mistake we usually do when convolving is that we aggregate all the channels information, sum them up and forward, doing so we are really missing out lot of information on channel dependencies. To make it questionable, Are all the feature maps I have learned are really important? This paper investigated a new architectural design — the channel relationship, by introducing a new architectural unit, which we term as “Squeeze-and-excitation” (SE) block.

The SE block tries to use global information to selectively emphasize informative features and suppress less useful once. In literal terms, it tries to add weights to each and every feature map in the layer.

ResNet Module and corresponding SE-ResNet Module

The squeeze operation, does a global max pooling channel wise. This kind of aggregates channel information and keeps it in a lower dimension.
The excitation operation kinds of need to decide which of the feature maps are really important or have signal them. Learning this is done using a new layer FC layer in between with a Relu layer. In the end a sigmoid layer is applied. The sigmoid activations act as channel weights adapted to the input-specific descriptor x. SE block intrinsically introduces dynamics conditioned on the input, helping boost feature discriminablity.
SE block can be used with any standard architectures. The authors have tested it on ResNet, ResNeXt, inception, inception-resnet etc
There will be very minute increments in-terms of params and computations (GFLOPS) because of extra layers like FC and pooling operations respectively.

Ending notes:

The googlenet continued to evolve over the period of time. With introduction of residual function, inception-resnetv2 has come out. Similarly inception variant of SENet has also suggested. similar things happend with ResNet architecture.
I haven’t discussed about Dual Path Networks. May be I will do it in the second iteration.
Andrew NG in his deeplearning.ai course talks about, high bias and variance cases. Do change your architecture if are facing high bias.

This is it. Starting from 2012 to 2018, we have seen how image classification problem has been solved by using different intuition. The choice of an architecture depends on a lot of parameters and is out of scope for this blog post. For a general practitioner, I suggest him to start with simple ResNets and then may be SENet, these are kind of worked for me.

Having discussed so much about architectures, I do feel the crux of the deep learning is in How we train these networks ?. Thats the question I have been simulating and checking over sometime. I will release a few things over a period of time. Stay tuned.

“pretrainedmodels” github repo have all the implementations of these architecture in PyTorch along with the pre-trained weights. Its quite simple to use.

I would like to thank Soumendra P and Prathamesh Sarang from AI Journal team for their support and valuable feedback. Also would like thank Fractal Deep learning Image team for their support and rectifying a lot of mistakes. Please Clap and share if you find the blog post useful.