Understanding Inception-v3
Rethinking the Inception Architecture for Computer Vision
1. Introduction
Rethinking the Inception Architecture for Computer Vision has been one of the most complicated research papers I have read. The paper’s language is very complicated along with its content. It was also difficult to find a detail breakdown of paper on the internet. So, the goal of this article is to breakdown Inception-v3 research paper in great detail.
2. General Design Principals
The research paper has discussed few principals that should be followed while designing convolutional neural network architectures.
- We should avoid representational bottleneck, especially early in the network. Representational bottlenecks occurs when there is sudden huge reduction in dimension of input, hence having huge number of weights. So, to avoid representational bottleneck we should gradually decrease the input dimensions before reaching the final representation.
- Higher dimensional representation are easier to process locally within a network. Modelling a compressed data is difficult. Having higher dimension helps to capture complex and disentangled features. If we have lower dimension, to capture the same complexity we need to have more entangled features.
- Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power. This means that we can reduce the dimension by using 1x1 convolutions before spatial aggregation (i.e 3x3) without losing much information.
- We should balance the width and depth of the network. Optimal performance of the network can be reached by balancing the number of filters per stage and the depth of the network. If we have larger width, to capture the complex features in the image we should have deeper network. The same concept was later used while designing the architecture of EfficientNets.
3. Factorizing Convolutions with Large Filter Size
The paper believes that original gains of GoogLeNet network arises from their generous use of dimensional reduction. Here dimensional reduction accounts for using 1x1 convolutions before 3x3 and 5x5 convolutions. This reduced number of parameters per stage and hence allowed to increase the depth of network. Similarly, we will see few techniques that can help to furthur reduce the dimensions.
3.1. Factorization into smaller convolutions
Large filters like 7x7 and 5x5 are very expensive in computation. For example, a 5x5 convolution with n filters over a grid is 25/9 = 2.78 times more computationally expensive than a 3x3 convolution with the same number of filters. However, 5x5 filter can capture dependencies between signals between activations of units further away in the earlier layers which is very helpful for capturing spatial invariance. So, can we create 5x5 convolution using 3x3 convolution? Yes, convolving two 3x3 filters can produce same output size as convolving one 5x5 filter. This way, we end up with a net (9+9)/25 × reduction of computation, resulting in a relative gain of 28% by this factorization.
Still, this setup raises two general questions: Does this replacement result in any loss of expressiveness? If our main goal is to factorize the linear part of the computation, would it not suggest to keep linear activations in the first layer? According to second question, if we apply ReLU after 5x5 conv then we should apply ReLU after two 3x3 conv, keeping linear activation after first 3x3 conv. They tested both models and found out ReLU activation performs better than linear activation.
3.2. Spatial Factorization into Asymmetric Convolutions
A major question arises that should we reduce 3x3 convolutions furthur to 2x2 convolutions? The paper states that reducing 3x3 convolutions into 2x2 convolutions results in 11% saving of computation while reducing 3x3 convolutions into 3x1 convolutions and 1x3 convolutions results in 33% saving of computation. They found that employing this factorization does not work well on early layers, but it gives very good results on medium grid-sizes (On m × m feature maps, where m ranges between 12 and 20). On that level, very good results can be achieved by using 1 × 7 convolutions followed by 7 × 1 convolutions.
4. Utility of Auxiliary Classifiers
GoogLeNet had introduced the concept of auxiliary classifiers for tackling problem of vanishing gradients and improve convergence of deeper CNN networks. The paper found that auxiliary classifiers did not result in improved convergence early in the training: the training progression of network with and without side head looks virtually identical before both models reach high accuracy. Near the end of training, the network with the auxiliary branches starts to overtake the accuracy of the network without any auxiliary branch and reaches a slightly higher performance. Instead, we argue that the auxiliary classifiers act as regularizer. This is supported by the fact that the main classifier of the network performs better if the side branch is batch-normalized or has a dropout layer.
5. Efficient Grid Size Reduction
To efficiently perform max pooling or average pooling, they present another variant that reduces the computational cost even further. We can use two parallel stride 2 blocks: P and C. P is a pooling layer (either average or maximum pooling) the activation, both of them are stride 2 the filter banks of which are concatenated.
6. Label Smoothing
The paper proposes mechanism to regularize the classifier layer by estimating the marginalized effect of label-dropout during training. For each training example x, our model computes the probability of each label k ∈ {1 . . . K}: p(k|x). We are using softmax activation layer. Consider the ground-truth distribution over labels q(k|x) for this training example, normalized so that summation of q(k|x) over all labels is 1. So, p(k|x) is probability distribution predicted by model and q(k|x) is ground truth distribution. Generally our q(k|x) is a dirac-delta function, where q(k|x) = S(k,y) i.e when k = y q(y|x) = 1 else q(k|x) = 0. The exact dirac delta cannot be achieved by p(k|x) but it can get close if p(y|x) >> p(k|x). This causes two problems: First, it may result in over-fitting: if the model learns to assign full probability to the ground-truth label for each training example, it is not guaranteed to generalize. Second, it encourages the differences between the largest logit and all others to become large, and this reduces the ability of the model to adapt. Intuitively, this happens because the model becomes too confident about its predictions. To solve this problem, they modified ground truth distribution.
A uniform distribution e/K is used which is independent of example x, where e is hyper-parameter and K is no of labels. While training on ImageNet data with K = 1000 classes,the paper used u(k) = 1/1000 and e = 0.1
7. Training Methodology
The paper achieved their best model by using RMSProp with decay of 0.9 and e = 1.0. They used a learning rate of 0.045, decayed every two epoch using an exponential rate of 0.94. In addition, gradient clipping with threshold 2.0 was found to be useful to stabilize the training. Model evaluations are performed using a running average of the parameters computed over time.
8. Conclusion
Inception-v3 was able to achieve excellent results on ILSVRC 2012 classification benchmark, not only outperforming the previous version Inception-v1 but also become the state of the art algorithm. With this I conclude the breakdown of Rethinking the Inception Architecture for Computer Vision research paper which introduced us with factorization, batch normalization and label smoothing.
9. References
[1]. Rethinking the Inception Architecture for Computer Vision by Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna