Evolution of Convolutional Neural Network Architectures

Published in

The PEN Point

11 min readMay 17, 2020

Convolutional Neural networks: An architectural overview | Source

AI has been gathering tremendous support lately for bridging the gap between humans and machines. Amazing discoveries in numerous fields are paving way for state-of-the-art technologies. One such field where we have seen tremendous improvement is that of Computer Vision. Deep Learning has been a player in this field with numerous improvements over time through the use of one particular model-Convolutional Neural Networks(CNNs).

The architecture of CNNs was inspired by the organizational of the Visual Cortex of the human brain. It is essentially a Deep Learning model that takes images, assigns respective weights for differentiation from one another, and does any given task such as image classification. Compared to hard-coded primitive approaches, these can be trained to learn the required filters on enough training. Over the years, there have been a number of developments in architectures to handle problems in regards to computational efficiency, error rate, and further improvements in the domain. The following models have been developed keeping in mind the said improvements and a number of factors, as discussed later. We have tried to maintain a chronological order for an easy read.

LeNet-5

LeNet-5 was the first “famous” CNN architecture, which was developed by LeCun et al. (1998), for recognition of handwritten digits. LeCun and his fellow researchers were working on CNN models for a decade to come up with an efficient architecture. LeNet-5 is greatly responsible for inspiring deep learning researchers to develop the very efficient CNN models which we use these days.

The simple architecture was as follows: INPUT -> CONV -> AVG_POOL -> CONV -> AVG_POOL -> FC -> FC -> OUTPUT
Used the MNIST database to train.
It was a very shallow CNN by modern regards and had only about 60,000 parameters to train for an input image of dimensions 32x32x1.
As we go deeper into the model, the input image dimensions tend to decrease, while the number of channels in a layer tends to increase.

AlexNet

Before the advent of AlexNet, CNNs had been the most sought-after models for object recognition. With a firm grip over the problem of overfitting, these are strong models quite easy to train, with a strong performance similar to that of standard feedforward neural networks of the same size. Despite such qualities and efficiency of their architecture, these proved to be expensive for large scale to high-resolution images, exactly the problem when ImageNet arrived.

Consists of eight layers — Five convolutional layers and three fully connected layers.
Uses ReLu (Rectified Linear Units) in place of the tanh function, giving a 6x times faster dataset than a CNN using tanh, for an error of 25% for the CIFAR-10 dataset.
Paved way for multi GPU training by splitting the neurons to be trained across multiple GPUs. This led to faster training times and the training of a bigger model.
Overlaps outputs of neighboring groups of neurons in opposition to the “pooling” of outputs, giving us a reduction in error by 0.5%.
The problems with overfitting increased with the use of 60 million parameters. This was taken care of by dropping out neurons with a predetermined probability (say 50%) and data augmentation.
The model won the 2012 version of the ImageNet competition with an error difference of more than 11% with the runner up.
Though an amazingly powerful model, the removal of any of the convolutional layers will drastically degrade the model’s performance.

ZFNet

ZFNet was an improved version of AlexNet, proposed by Zeiler et al. (2013). The main reason that ZFNet became widely popular because it was accompanied by a better understanding of how CNNs work internally. Earlier, researchers were never fully sure why ConvNets work for computer vision. But with ZFNet, came a novel visualization technique through a deconvolutional network. Deconvolution can be defined as the reconstruction of any convoluted features into a human-comprehensible visual form. Hence, this helped researchers to know what they were exactly doing.

An example of visualization of features in each layer | Source: Zeiler et al. (2013)

The architecture was similar to AlexNet, but 7x7 filters were used in the initial layers instead of 11x11 filters in order to prevent loss of features.
Trained using batch stochastic gradient descent.
Trained on the ImageNet dataset with an accuracy of 85.2%, surpassing AlexNet.

Inception

Inception Networks were proposed by Szegedy et al. (2014) and brought along a novel concept of multitasking to the CNNs. With an aim to reduce the computation costs in CNNs, this architecture suggested that instead of building extensive deep networks, we can stack multiple convolutions in a single layer. This model also introduced the use of 1x1 filters for dimensionality reduction, in order to generate small-sized layers. In simple words, for an example, an inception network would allow us to do 3x3 CONV, 5x5 CONV, and MAX_POOL simultaneously, pass them through 1x1 convolution before or after these actions (before in CONV and after in POOL) and finally concatenate the corresponding outputs across the 3rd dimension.

Inception networks paved the way for many other CNN architectures based on the same principles, such as GoogLeNet, Inception v3, Inception v4, Xception, etc, with some changes in the architecture. GoogLeNet is discussed below.

GoogLeNet

GoogLeNet was proposed by Szegedy et al. in 2015 as the initial version of the Inception; this model put forward state of the art image classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14) and secured the first place in the competition. In place of fully connected layers, it uses global average pooling, with 1x1 convolution in the middle of the network.

The network is 22 layers deep (27 layers if pooling is included); a very deep model when compared to its predecessors!
A 1x1 convolution with 128 filters helps with dimensionality reduction and rectified linear activation.
An average pooling layer with 5x5 filter size and stride 3.
A fully connected layer with 1024 units and ReLu.
A linear layer with softmax used for classification

VGG-16

VGG-16 was the next big breakthrough in the deep learning and computer vision domains, as it marked the beginning of very deep CNNs. Earlier, models like AlexNet used high dimensional filters in the initial layers, but VGG changed this and used 3x3 filters instead. This ConvNet developed by Simonyan and Zisserman (2015) became the best performing model at that time and fueled further research into deep CNNs.

It was trained on the ImageNet dataset and achieved state-of-the-art results with up to 92.7% accuracy, beating the GoogLeNet and Clarifai.
It approximately had an overwhelming 138 million parameters to train which was more than at least twice the number of parameters in other models used then. Hence, it took weeks to train.
It had a very systematic architecture. As we move to deeper layers, the image dimensions halved, while the no. of channels (or the no. of filters used in each layer) doubled.
A prominent drawback of this model was that it was extremely slow to train and huge in size, making it less practical for real-time deployment.

ResNet | ResNeXt

ResNet was put forward by He et al. in 2015, a model that could employ hundreds to thousands of layers whilst providing compelling performance. The problem with deep Neural Networks was of the vanishing gradient, repeated multiplication as the network goes deeper, thereby resulting in an infinitely small gradient.

Residual Block | Source: He et al. (2015)

ResNet looks to introduce “shortcut connections” by skipping one or more layers. Here, these perform identity mappings, with outputs added to those of the stacked layers. With 152 layers (deepest back then) used, ResNet won the ILSVRC 2015 classification competition with a top 5 error of 3.57%. With an increasing demand in the research community, different interpretations of the ResNet were developed. The following model treats ResNet as an ensemble of many smaller networks.

A block of ResNeXt with cardinality = 32 | Source: Xie at al. (2017)

Xie at al. proposed this variant of the ResNet(called the ResNeXt); this is similar in looks to the Inception module (both perform split-transform-merge); however, the outputs of different paths are added together, while they are depth concatenated in the latter. Furthermore, every path is the same in terms of topology, the Inception follows varying topologies for different paths (1x1, 3x3, 5x5 convolution).

Authors introduce cardinality, a hyperparameter that makes the model adaptable to different datasets and increased accuracy on a higher value.
Divides the input into groups of feature maps to perform novel convolution, and the outputs are then fed into concatenated by the depth and fed into a 1x1 convolution layer.

DenseNet | ConDenseNet

The idea of DenseNet stemmed from the intuition that CNNs could be substantially deeper, accurate and efficient to train if there are to be shorter connections close to the input and those close to the output. In sum, every layer is connected to every other layer in a feed-forward fashion.

5-layer dense block with a growth rate of k = 4 | Source: Huang et al. (2016)

Has (L(L+1))/2 direct connections in the network, all layers are interconnected
Substantial reduction in the number of parameters, vanishing gradient handled, encourage feature reuse, encourage feature propagation
For the ImageNet dataset, the model is at par with state-of-the-art ResNets, whilst requiring a lesser number of parameters and less computational power
Can be scaled to hundreds of layers, with no difficulties in optimization
Shows no sign of degradation or overfitting with an increase in the number of parameters, with an increasing accuracy

CondeseNet was proposed in 2018 by Huang et al. as an improved version of DenseNet with better efficiency. Combined with a novel model called group convolution, it facilitates feature reuse and removes layers that are unnecessary. It is found to be easy to implement and outperform networks like ShuffleNet and MobileNet, taking in mind the computational efficiency at the same accuracy.

Learned group convolutions with 3 groups, condensation factor of C = 3 | Source: Huang et al. (2018)

Learns a sparsing network automatically during the training process, producing a regular connectivity pattern for implementation using group convolutions
The filters of a later are divided into multiple groups, and unnecessary features are removed for these groups during training
Groups of incoming features are learned, these are not predefined
For similar accuracy levels, it uses 1/10th of the computational power needed for traditional DenseNets

ShuffleNet

With a good grip over efficiency and results over the years, there was still a need for some solution that performed in a similar manner whilst requiring limited computational power. This was essential for low-end devices such as mobiles, drones, robots, etc. Through GroupWise convolution and channel shuffle, we fulfill our motive for similar values of accuracy.

Channel shuffle with two stacked group convolutions | GConv: Group convolution | Source: Zhang et al. (2017)

Uses a novel shuffle operation to ease the flow of information across channels
For ImageNet classification, they obtain a lower top 1 error than the MobileNet system, with ~x13 speedup over AlexNet with similar accuracy levels
It is found to continuously outperform MobileNet for platforms with lower computational power. Hence, it paves the way for usage in mobile devices in the future.

FractalNet

FractalNet is an interesting CNN because it drifts away from the trending ResNets, and build its own deep architecture without any residual blocks. Simply stating, FractalNet can be viewed as an alternative to ResNets for very deep networks. Proposed by Larsson et al. (2017), the figure below shows one fractal unit, stacked up to form a fractal block, which then stacks up to form the FractalNet.

Fractal Architecture | Source: Larsson et al. (2017)

Regularization is done using global (a single fixed path) and local (probability based path) drop-path, in order to prevent overfitting.
FractalNet outperforms several ResNets on numerous tasks.
It achieves an accuracy of 92.61% on the ImageNet dataset, just a little higher than ResNet.

R-CNNs

R-CNNs come up with the proposition that only certain regions of the image contain the required features and these regions must be fed to the CNN models, thus the name region-based CNNs. Their main application is in object detection which has to be done in a lot of real-time systems. We will discuss some of the R-CNNs below -

Fast R-CNN architecture | Source: Girshick (2015)

Girshick (2015) improved his own R-CNN to create Fast R-CNN, by not extracting the regions of interest first, but feeding the whole image as input, with the regions of interest being extracted in the network and reshaped using the pooling layer. This drastically reduced the training and test time, because thousands of regions of the same image did not have to be fed into the model.

Reason Proposal Network (RPN)/ Faster R-CNN Architecture | Source: Ren et al. (2015)

Ren et al. (2015) proposed a new network in order to reduce the computation time of Fast R-CNN even more. Rather than the selective searching in the network, the image is first passed into a different network called RPN. This network has been trained to detect the region proposals and thus, the output region from this RPN is then fed into a CNN. Faster R-CNN made the R-CNN fast enough to be deployed for real-time applications.

Mask R-CNN architecture | Source: Medium

Mask R-CNN adds additional functionality to the Faster R-CNN, i.e., instance segmentation (object masks). An extra branch or stage is added to the Faster R-CNN, with relatively very low computation cost. Proposed by He et al. (2017), it excels in multi-classification, creating bounding boxes and masks, as well as human pose estimation. And since it does not add much load to the computation, it is even more deployable and useful for real-time applications.

That’s the gist of the amazing science behind CNNs and how their architectures have evolved over time, and what are researchers trying to work on. Nowadays, computer vision tasks have hundreds of millions of parameters to train, which makes it a very extensive task. New architectures are coming up all the time and the old architectures are being worked upon to improve their accuracies and computation time. Let’s see which model becomes the next ground-breaking model in the upcoming years!

This article was co-written with Hamza Abubakar. Feel free to reach out to either one of us for questions or suggestions.

Aaryan Gupta — Computer Engineering, Nirma University, Ahmedabad

LinkedIn|Instagram

Hamza Abubakar — Computer Engineering, Nirma University, Ahmedabad