ConvNet Architectures for beginners Part II

Published in

SRM MIC

6 min readJul 28, 2020

Convolutional Neural Network (source:- medium.com)

In ConvNet Architectures for Beginners Part I, we talked about how it can overwhelm a beginner in the industry to choose among the many CNN architectures available and talked about some of these architectures. However, for new readers the previous part we talked about LeNet-5, AlexNet and VGG-16.

Inception (GoogLeNet)

Researchers at Google introduced Inception in 2014 as a submission at ImageNet Recognition Challenge. Inception has had 4 versions till date with the most recent one beating ResNet(which we will talk about in some time) in terms of accuracy on ImageNet dataset.

Structure

Inception module naive version (left), Inception module with dimension reduction (right) (source:- arxiv.org)

The model is made up of a basic unit called an Inception Module, in which we perform a series of convolutions at different scales and aggregate the results. For each module, we learn a set of 1x1, 3x3 and 5x5 filters which learn to extract features from the input image at different scales.

The problem with the naive version of the module is that even a modest number of 3x3 or 5×5 convolutions can be computationally expensive on top of a convolutional layer with a large number of filters. This problem is resolved by reducing the dimensions of the activations through 1x1 convolutions before passing them through the higher order convolutions.

Why do 1x1 convolutions work:- 1x1 convolutions are basically just a filter with a dimension of 1x1, this translates to having just 1 feature map contrary to directly using filters of higher dimension. These 1x1 filters will each make one activation map and each of these will focus on unique part in the image/input.

These dimension reduced modules come together to form the Inception v1 or more popularly known as GoogLeNet. The architecture comprises 9 such modules which make up for 22 layers (27 including pooling layers). At the end of the architecture, a global average pooling which calculates the average of every feature map replaces the fully connected layers, which considerably declines the number of parameters.

Inception v3 (source:- www.jeremyjordan.me)

Inception v3:- The third edition to the Inception net comprises more advanced inception modules counting to a 44 layer deep network.

Parameters

~5 million (V1) and ~23 million (V3)

Application and Achievements

Inception v1 has produced the record lowest error at ImageNet classification dataset at the ILSRVRC 2014. Google eventually used the Inception network to develop their Deep Dream program, which uses a convolutional neural network to find and enhance patterns in images via algorithmic pareidolia, thus creating a dream-like appearance in the deliberately over-processed images.

We need to go deeper (source:- knowyourmeme.com)

Fun Fact:- Inception net derives its name from the famous 2010 movie with the same name. The movie has the protagonist diving deep into his target’s subconscious through dreams to steal information.

ResNet

Residual Neural Network or ResNet was introduced by researchers at Microsoft in 2015. Contrary to Inception the purpose here was to make it possible to work with deeper neural networks instead of trying to limit the layers (hundreds of layers opposed to the tens of layers).

Structure

Training with deeper neural network led to the unexpected, the accuracy seemed to be decreasing which on further research was found to be because of the considerable increase in vanishing gradients.

Accuracy decreasing with deeper networks (source:- arxiv.org)

Vanishing Gradients:- Certain activation functions like the Sigmoid function squish a large input to a small output which implies that a large change in the input of the sigmoid function will cause a small change in the output which in turn will make their derivatives small during the Back Propagation step. This process when takes place in deep neural networks tend to affect the weights of the initial layer the most making the gradients extremely small, hence the “Vanishing Gradients”.

Canonical form of Residual Blocks (source:- wikipedia.org)

The paper introduced the concept of using Residual Blocks in the deep neural networks. Residual Blocks fast forward the activations of a layer to deeper layers of the network. This basically means that the model learns how to adjust input feature map for deeper layers by taking activations from the shallow layers.

ResNet-34 (source:- towardsdatascience.com)

Each colored block of layers represent a series of convolutions of the same dimension. The feature mapping is periodically down-sampled by convolution accompanied by an increase in channel depth to preserve the time complexity per layer.

Parameters

~25 Million (ResNet-50)

Application and Achievements

ResNet-152 (advanced version of the above shown ResNet-34) was the winner of ILSVRC 2015 (ImageNet) in image classification, detection, and localization, as well as Winner of MS COCO 2015 detection, and segmentation.

MS COCO:- it is an object detection dataset with 80 classes, 80,000 training images and 40,000 validation images.

DenseNet

DenseNet is a joint venture of researchers at Cornell University and Facebook, which took inspiration from Microsoft’s ResNet. The principal purpose of DenseNet was to build on the concept of ResNet, referring to previous layer for feature map influence, the only difference being instead of just taking feature map from one layer it concatenated feature maps from multiple layers following a general rule.

Structure

The persistent problem with deep neural networks remained the Vanishing Gradient problem to which ResNet worked to some extent, but seemed to have created a fresh problem, the sheer number of parameters to be learned for a network as small as 50 layers translated to a rather computationally expensive model to train. Introduction of DenseNet considerably decreased these parameters with a contesting validation error to that of ResNet.

The principle idea of DenseNet is to create these pre-defined Dense Blocks, which are a set of layers grouped together, in which each layer is connected to each other. These connections are responsible of concatenation of feature maps of each of the layers preceding the layer in focus in a Dense Block.

One might argue that how come the learnable parameters for DenseNet are lower than ResNet due to the increased inter-connectivity of layers. This is where DenseNet make the first difference, DenseNets do not sum the output feature maps of the layer with the incoming feature maps but concatenate them. Turning the equation of ResNet from

The beauty of DenseNet lies in the fact that each layer has access to its preceding feature maps, and therefore, to the collective knowledge and that each layer is then adding a new information to this collective knowledge, much likes us humans.

DenseNet Architecture (source:- arxiv.org)

The architecture of DenseNet has that underlying feeling of familiarity to that of ResNet model architecture, where the Dense Blocks are the repeated unit.

Parameters

~0.8 Million (DenseNet-100)

Achievements

The paper for DenseNet was appraised as the Best Paper in 2017 CVPR.

CVPR:- The Conference on Computer Vision and Pattern Recognition is an annual conference on computer vision and pattern recognition, which is regarded as one of the most important conferences in its field. (source:- wikipedia.org)