INCEPTION-v1 Paper Walkthrough

7 min readOct 2, 2023

INTRODUCTION

The main hallmark of this architecture was the improved utilization of computational resources inside the network which could happen due to carefully drafted design. It allowed for the increasing width and depth of the network keeping the computational budget constant.
This is based on Hebbian principle and the intuition of multiscale processing.
The Hebbian principle is a concept in neuroscience and machine learning that states “cells that fire together, wire together.” It suggests that when two neurons are activated at the same time, the connection (synapse) between them strengthens. This principle is foundational in understanding how neural networks and biological brains learn through synaptic plasticity.
Multiscale processing, in the context of information processing, involves analyzing data at multiple levels of detail or granularity simultaneously. The intuition behind multiscale processing is that by considering information at different scales, you can capture both fine-grained details and overarching patterns in data, leading to a more comprehensive understanding of complex systems or phenomena.
It was submitted in ILSVRC14 as a 22-layer architecture named GoogleNet tested for the context of classification and detection
During the time this paper was released, most of the progress made in the field of deep learning came from new ideas, algorithms and better architectures in addition to powerful hardware, larger datasets and bigger models.
Also, instead of just fixating on the accuracy, the researchers were also keeping an eye on the computational budget(multiply-adds) to put them to real world use.
The architecture was code named as Inception inspired by the meme

MOTIVATION

The most straightforward way of improving the performance of deep learning models is by increasing the size which leads to following problems:

Increase in number of parameters
Bigger size might lead to overfitting
Requirement of the creation of high quality training data
Increased computational resources

Thus, the authors were trying to work on options other than just increasing the number of layers in the architecture.

Why odd numbered filters?

Convolution is element-wise multiplication and addition which is intended to be used to encode the source data matrix in terms of filters/kernel. When placed over the source matrix, we want to extract information related to a source pixel and its nature with its neighbor. This way, it makes sense to extract information from its surroundings in a symmetrical manner. To do so, we would need an even number of pixels for the surrounding and 1 for itself(center-pixel), hence 2n+1 should be the ideal size of the filter.
If an even size filter is used, the aliasing effect is seen in the output. Aliasing is said to have occurred when high-frequency components of the original signal become indistinguishable from the low-frequency components of the original signal. An example of this can be if a non zero element exists with zeroes as its neighbors, an odd numbered filter can be trained to have similar form to highlight that particular cell in its output even while traversing with any stride. But, an even numbered kernel mightnt able to highlight that cell (imagine with stride=1).

ARCHITECTURE:

Instead of a single filter size, the authors have used 3 different sizes of filters i.e. 1x1, 3x3, 5x5 filters and a pooling layer within an inception module which convolutes over the input matrix and concatenates the results.
This helps in extracting multi-level features from the image where 1x1 can be imagined to extract local level information and 5x5 extracting more global information.
Despite the concerns that max-pool layers might result in loss of accurate spatial information(by many researchers), it was still used as its been essential for the success in many SOTA architectures. Now, this was termed as Naive-version where due to the presence of 5x5 filters and 3x3 maxpool, there would be computational blowup in further stages.

Naive vs Implemented version of Inception-v1 (Source : Link)

This led to the need to create a more evolved version where 1x1 convolutions are used to reduce the channels of output feature maps which can then be fed to 3x3 and 5x5 filters. Note that 1x1 convolution is used before the 3x3 and 5x5 kernels but after the 3x3 maxpool layer.
1x1 has a dual purpose here: dimension reduction and being used as rectified linear activation.
Here, the authors claim that even lower dimensional embeddings( generated by 1x1 for inputs to 3x3 and 5x5 ) still might contain a lot of information about a relatively large image patch. Using 1x1 has provided following 2 benefits:

(a) Significantly reducing computational cost by limiting the size of filters required going from 1 layer to another.
(b) Multi-scale feature extraction due to various sizes of filters.

Following is an example of reduction in total computational cost when 1x1 is being used. We are comparing the computation for the scenerios where input channels are low (3) and high in number (100).

*left : high input channels right : low input channels*

Using 1x1 conv to reduce the computation is helpful mostly in case when the input channels are higher in number.
Inception Network is an architecture of such inception modules stacked over each other with occasional max pooling(stride=2) to reduce the size of feature maps to half.
In addition to the pros, the authors have also pointed out 1 con where Inception requires careful manual design of such architectures where changes in the same would require some additional manual effort.

GoogLeNet

GoogLeNet name is homage to Yann LeCuns pioneering LeNet5 network. GoogLeNet is a reference to the Inception architecture used for submission in competition

Following is the description of reduce terms used in the table.

InceptionBlock: With inception block, the input and output feature map resolution remains the same. Only the #filters change. Feature map resolution is changed only by maxpools. Here, lets calculate the number of filters for inception layer 3a.

Global Average Pooling: With Fully connected networks, all the inputs are connected to outputs i.e. for converting (7x7x1024) to (1x1x1024), total parameters requried = 7x7x1024x1024 (51.3M). But to do the same operation with Global Avg Pool Layer, requires 0M weights since no weights are required for averaging. This motivated the authors to switch from fully connected layers to global average pooling. Also, with switch from Fully connected to Avg pooling, top-1 accuracy jumped by 0.6%
Auxiliary Classifiers: Due to the large depth of the network, the ability to propagate gradients back through all the layers was a concern. The authors wanted the middle of the network to be very discriminative. Thus, they added auxiliary classifiers to these intermediate layers which would encourage discrimination of classes in the lower layers. During training, discounted weightage was utilized

# The total loss used by the inception net during training.

total_loss = real_loss + 0.3 * aux_loss_1 + 0.3 * aux_loss_2

Google Net with Auxiliary Classifiers(Source : Link)

PERFORMANCE

At the time, GoogLeNet generated competitive results as compared to other architectures for both classification and detection tasks as shown in the following comparision tables.

CONCLUSION

Our exploration of the Inception-v1 paper has revealed its pivotal role in deep learning and computer vision. This innovative architecture, known as GoogleNet, has redefined efficiency and performance in neural networks. Its emphasis on dense building blocks and multiscale processing strikes a balance between computational efficiency and model quality, yielding competitive results in image classification and object detection.
Inception-v1’s impact extends beyond its accuracy; it has inspired subsequent architectures and shaped the future of deep learning. It serves as a testament to the power of innovative design in this field.
As we conclude, Inception-v1’s legacy continues to influence the development of neural networks, offering a robust foundation for addressing complex computer vision tasks. It reminds us that the pursuit of efficiency and creativity in deep learning remains a dynamic and promising journey in the realm of artificial intelligence.

REFERENCE

Inception-v1 paper