Know about Inception v2 and v3; Implementation using Pytorch

Published in

Nerd For Tech

6 min readJun 26, 2021

Hi Guys! In this blogs, I will share my knowledge, after reading this research paper, what it is all about! Before I proceed it, I want you to know that I didn’t go and study very extensively. It was only means to understand that

What this research paper is all about?
How was different than previous state-of-the-art model?
What was the result of this novel approach compared to old ones (previous ones)?

So. all of these are written here as a key points.

Inception v2 is the extension of Inception using Factorizing Asymmetric Convolutions and Label Smoothing.
Inception v3 (Inception v2 + BN-Auxiliary) is chosen as the best one experimental result from different Inception v2 models.

Abstract

Although increased model size and computational cost tend to translate to immediate quality gains for most tasks, computational efficiency and low parameter count are still enabling factors for various use cases.
Here they exploring a ways to scale up the networks in the ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization

Factorizing Convolutions with Large Filter Size

Since Inception networks are fully convolutional, each weight corresponds to one multiplication per activation. Therefore, any reduction in computational cost results in reduced number of parameters.
Convolutions with larger spatial filters (e.g. 5 × 5 or 7 × 7) tend to be disproportionally expensive in terms of computation.

For example; as per the research paper,

Mini-network replacing the 5x5 conv with 3x3 conv and 1x1 conv

a 5 × 5 convolution with n filters over a grid with m filters is 25/9 = 2.78 times more computationally expensive than a 3 × 3 convolution with the same number of filters. Of course, a 5×5 filter can capture dependencies between signals between activations of units further away in the earlier layers, so a reduction of the geometric size of the filters comes at a large cost of expressiveness. However, we can ask whether a 5 × 5 convolution could be replaced by a multi-layer network with less parameters with the same input size and output depth. In left side figure given, If we zoom into the computation graph of the 5 × 5 convolution, we see that each output looks like a small fully-connected network sliding over 5×5 tiles over its input.

Having a two layer replacement for the 5 × 5 layer, it seems reasonable to reach this expansion in two steps: increasing the number of filters by √α in both steps.
It end up with a net (9+9)/25 × reduction of computation, resulting in a relative gain of 28% by this factorization.

But it raises two questions:

Q1- Does this replacement result in any loss of expressiveness?

Q2- If our main goal is to factorize the linear part of the computation, would it not suggest to keep linear activations in the first layer?

Ans- We have ran several control experiments (for example see Figure 1) and

using linear activation was always inferior to using rectified linear units in all stages of the factorization.

They attribute this gain to the enhanced space of variations that the network can learn especially if we batch normalize the output activations. One can see similar effects when using linear activations for the dimension reduction components.

Spatial Factorization into Asymmetric Convolutions

We can ask the question whether one should factorize them into smaller, for example 2×2 convolutions. However, it turns out that one can do even better than 2 × 2 by using asymmetric convolutions, e.g. n × 1.

Figure 3: Mini-network replacing the 3 × 3 convolutions. The lower layer of this network consists of a 3 × 1 convolution with 3 output units

For example (as per research paper) using a 3 × 1 convolution followed by a 1 × 3 convolution is equivalent to sliding a two layer network with the same receptive field as in a 3 × 3 convolution (see Figure 3)
Still the two-layer solution is 33% cheaper for the same number of output filters, if the number of input and output filters is equal. By comparison, factorizing a 3 × 3 convolution into a two 2 × 2 convolution represents only a 11% saving of computation.

One can replace any n × n convolution by a 1 × n convolution followed by a n × 1 convolution and the computational cost saving increases dramatically as n grows (see Figure 6 below).

Figure 6: Inception modules after the factorization of the n × n convolutions.

Utility of Auxiliary Classifiers

They found that

Auxiliary classifiers did not result in improved convergence early in the training: the training progression of network with and without side head looks virtually identical before both models reach high accuracy.
Near the end of training, the network with the auxiliary branches starts to overtake the accuracy of the network without any auxiliary branch and reaches a slightly higher plateau.

The auxiliary classifiers act as regularizer. This is supported by the fact that the main classifier of the network performs better if the side branch is batch-normalized or has a dropout layer.

Efficient Grid Size Reduction

In order to avoid a representational bottleneck, before applying maximum or average pooling the activation dimension of the network filters is expanded.
We can use two parallel stride 2 blocks: P and C. P is a pooling layer (either average or maximum pooling) the activation, both of them are stride 2 the filter banks of which are concatenated as in Figure 5.

Inception-v2

Factorized the traditional 7 × 7 convolution into three 3 × 3 convolutions.
For the Inception part of the network, we have 3 traditional inception modules at the 35×35 with 288 filters each. This is reduced to a 17 × 17 grid with 768 filters using the grid reduction technique.

This is is followed by 5 instances of the factorized inception modules as depicted in Figure 5.

Figure 6: (5xInception) — Already there above. Under “Spatial Factorization into Asymmetric Convolutions” Section.

This is reduced to a 8 × 8 × 1280 grid with the grid reduction technique depicted in Figure 10 (Under “Efficient Grid Size Reduction” section).

At the coarsest 8 × 8 level, we have two Inception modules as depicted in Figure 6, with a concatenated output filter bank size of 2048 for each tile.

This network is 42 layers deep, this computation cost is only about 2.5 higher than that of GoogLeNet and it is still much more efficient than VGGNet.

Model Regularization via Label Smoothing

Training Methodology

stochastic gradient utilizing the TensorFlow distributed machine learning system using 50 replicas running each on a NVidia Kepler GPU
Batch size=32
Epoch=100
RMSProp with 0.9 decay and 1.0 noise
learning rate of 0.045, decayed every two epoch using an exponential rate of 0.94. In addition, gradient clipping with threshold 2.0 was found to be useful to stabilize the training.

Experimental Results and Comparisons

Table 3 shows the experimental results about the recognition performance of our proposed architecture (Inceptionv2)

BN-auxiliary refers to the version in which the fully connected layer of the auxiliary classifier is also batch-normalized, not just the convolutions.

Complete Code from Scratch

That’s all the key-points for this research paper. Hope you got it.

Thank you for reading it and Have a nice day! :D

Here my LinkedIn Profile.