Google’s Lenet(Inception net)

Juber Gandharv
8 min readOct 18, 2019

--

In this blog we will review the GoogleLeNet introduced by google in 2014 ,which is the winner of ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14), which has significant improvement over ZFNet (The winner in 2013) and AlexNet (The winner in 2012) , and has relatively lower error rate compared with the VGGNet (1st runner-up in 2014). ImageNet is a dataset of over 15 million labeled high-resolution images with around 22,000 categories. ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 100,000 testing images.

Error Rate in ILSVRC 2014(%)

As the authors said that the idea of the name “Inception” comes from famous internet meme below: WE NEED TO GO DEEPER

source:https://knowyourmeme.com/memes/we-need-to-go-deeper

Table of Contents:

Background

Inception Network Motivation

The 1x1 Convolutions

Inception Module

Proposed Architectural Details

Different Version of Inception Net

Background:

Standard CNN structure up until 2014 was stacked convolutional layers, maybe max-pooling, then one or more fully-connected layers

This has limits-

  • Large memory footprint
  • Large computation demand
  • Prone to overfitting
  • Vanishing and exploding
  • gradients
Standard CNN structure

Inception Network Motivation-

The most straightforward way of improving the performance of deep neural networks is by increasing their size. This includes both increasing the depth-the number of network levels as well as its width: the number of units at each level. This is an easy and safe way of training higher quality models, especially given the availability of a large amount of labeled training data. However, this simple solution comes with two major drawbacks.

  • Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting, especially if the number of labeled examples in the training set is limited.
  • The other drawback of uniformly increased network size is the dramatically increased use of computational resources.

A fundamental way of solving both of these issues would be to introduce sparsity and replace the fully connected layers by the sparse ones, even inside the convolutions. Besides mimicking biological systems, this would also have the advantage of firmer theoretical underpinnings due to the groundbreaking work of Arora et al.

Densely connected architecture
Sparsely connected architecture

The 1x1 Convolutions:

Before going further into inception net let’s first clear some basics, and understand how 1x1 convolution reduce the computational cost and how it works

originally 1X1 convolution is introduced by NIN to introduce non-linearity in data, but In GoogLeNet, 1×1 convolution is used as a dimension reduction module to reduce the computation. By reducing the computation bottleneck, depth and width can be increased.

let’s understand how it reduces the computational cost, pick a simple example to illustrate this. Suppose we need to perform 5×5 convolution without the use of 1×1 convolution as below:

number of parameters =28x28(size of the input image)x5x5(size of filter)x192(channels)x32(no of filters)120,422,400 operations

5x5 convolution without 1x1 convolution in between(image source)

With 1x1 convolution in between, the number of parameters becomes: 28x28x1x1x192x16 + 28x28x5x5x16x32 =12,443,648 -> A tenfold reduction in number of operations

5x5 convolution using 1x1 bottleneck convolution(image source)

Inception Module:

Inception Layer is a combination of 1×1 Convolutional layer, 3×3 Convolutional layer, 5×5 Convolutional layer with their output filter banks concatenated into a single output vector forming the input of the next stage, Now as we can see in image of naive version of inception module direct 3X3,5X5 convolutions are used which is to expensive to compute and so in the second image of inception module we used 1X1 convolution to reduce the dimension and which lead us to less expensive computation( How 1x1 convolution reduce the computation is already explained in 1x1 convolution topic of this blog)

Clearing all basics for the 1x1 convolution and inception block now let’s deep dive in the overall architecture of inception net

Proposed Architectural Details:

GoogLeNet architecture with nine inception modules(image source)

There are a total of 22 layers, It is already a very deep model compared with previous AlexNet, ZFNet, and VGGNet. (But not so deep compared with ResNet invented afterward.) And we can see that there are numerous inception modules connected together to go deeper.

Auxiliary classifier:

Auxiliary classifier module

Why Auxiliary module?

As the model is deep there may be chances of vanishing gradient problem(weights become very small during backpropagation), to overcome this problem the author introduces 2 auxiliary classifiers.

Here every auxiliary classifier contributes some weights and hence now gradients does not vanishes during backpropagation. The total loss function is a weighted sum of the auxiliary loss and the real loss. The weight value used in the paper was 0.3 for each auxiliary loss.

Different Version of Inception Net

Inception V2 -

Inception v2 and Inception v3 were presented in the same paper. The authors proposed a number of upgrades which increased the accuracy and reduced the computational complexity. Inception v2 explores the following:

Reduce representational bottleneck : The intuition was that neural networks perform better when convolutions didn’t alter the dimensions of the input drastically. Reducing the dimensions too much may cause loss of information, known as a “representational bottleneck”

Using smart factorization methods, convolutions can be made more efficient in terms of computational complexity.

Factorize 5x5 convolution to two 3x3 convolution operations to improve computational speed. Although this may seem counterintuitive, a 5x5 convolution is 2.78 times more expensive than a 3x3 convolution. So stacking two 3x3 convolutions, in fact, leads to a boost in performance. This is illustrated in the below image.

Now in order to get a computationally more cheaper network author of the paper factorized nxn convolution into a combination of 1xn and nx1 convolution, For example, a 3x3 convolution is equivalent to first performing a 1x3 convolution, and then performing a 3x1 convolution on its output. They found this method to be 33% cheaper than the single 3x3 convolution.

Here, put n=3 to obtain the equivalent of the previous image. The left-most 5x5 convolution can be represented as two 3x3 convolutions, which in turn are represented as 1x3 and 3x1 in series.

The filter banks in the module were expanded (made wider instead of deeper) to remove the representational bottleneck. If the module was made deeper instead, there would be an excessive reduction in dimensions, and hence loss of information. This is illustrated in the below image.

Making the inception module wider

Label Smoothing

In brief: “a mechanism to regularize the classifier by estimating the effect of label-dropout during training”

Inception V3 -

The authors noted that the auxiliary classifiers didn’t contribute much until near the end of the training process when accuracies were nearing saturation. They argued that they function as regularizes, especially if they have BatchNorm or Dropout operations.

Possibilities to improve on the Inception v2 without drastically changing the modules were to be investigated.

  • Inception Net v3 incorporated all of the above upgrades stated for Inception v2, and in addition, using the following:
  1. RMSProp Optimizer.
  2. Factorized 7x7 convolutions.
  3. BatchNorm in the Auxillary Classifiers.
  4. Label Smoothing (A type of regularizing component added to the loss formula that prevents the network from becoming too confident about a class. Prevents overfitting).

Inception V4 -

A more uniform simplified architecture and more inception modules than Inception- v3, is introduced as below:

This is a pure Inception variant without any residual connections. It can be trained without partitioning the replicas, with memory optimization to backpropagation. We can see that the techniques from Inception-v1 to Inception-v3 are used. (Batch Normalisation is also used but not shown in the figure.)

Inception-ResNet-V1 and V2 are the hybrid inception module inspired by the performance of ResNet

Inception-ResNet-V1 and V2

There are two sub-versions of Inception ResNet, namely v1 and v2. Before we check out the salient features, let us look at the minor differences between these two sub-versions.

  1. Inception-ResNet v1 has a computational cost that is similar to that of Inception v3, And Inception-ResNet v2 has a computational cost that is similar to that of Inception v4.
  2. They have different stems, as illustrated in the below image.
left side: Stem for ResNet V1 | right side: Stem for ResNet V2 and inception V4

3. Both sub-versions have the same structure for modules A, B, C and . the reduction blocks. The only difference is the hyper-parameter settings. In this section, we’ll only focus on the structure. Refer to the paper for the exact hyper-parameter settings as shown below:

Inception-ResNet-A, Inception-ResNe-B, and Inception-ResNet-C module for the Inception ResNet V1
Inception-ResNet-A, Inception-ResNe-B, and Inception-ResNet-C module for the Inception ResNet V2

Each Inception block is followed by a filter expansion layer
(1 × 1 convolution without activation) which is used for scaling up the dimensionality of the filter bank before the addition to match the depth of the input.

In the case of Inception-ResNet, batch-normalization is used only on top of the traditional layers, but not on top of the summations.

Scaling of Residuals

According to the authors, if the number of filters exceeded 1000, the residual variants started to exhibit instabilities and the network has just “died” early in the training, meaning that the last layer before the average pooling started to produce only zeros after a few tens of thousands of iterations. This could not be prevented, neither by lowering the learning rate nor by adding an extra batch normalization to this layer.

Scaling of Residuals

According to them, scaling down the residuals before adding them to the previous layer activation seemed to stabilize the training. To scale the residuals, scaling factors between 0.1 and 0.3 were picked

Thank you for reading this blog! Hit the clap button if this article helped you to get clear vision about concept of Google-Lenet if you have any question send an email on(juber269@gmail.com)

--

--