Two articles ago, we dissected the structures of AlexNet and VGG-family. While the two networks differ largely in their choice of filters, strides and depths, they all have the straight-forward linear architecture. GoogLeNet (and later the whole Inception family) in contrast, has a more complex structure.
At the first glance of the well-known GoogLeNet structure diagram and table (see below), we tend to be over-whelmed by the nontrivial sophistication in the design, and baffled by the specification over the filter choices. However, once we understand the philosophy behind the “Inception” module, the whole structure of GoogLeNet can quickly collapse into a simple, traceable pattern.
- Explanation of the Inception module
- Overall architecture of GoogLeNet
- PyTorch implementation and discussions
With its stylized name paying homage to LeNet, GoogLeNet was the winner for ImageNet 2014 classification challenge. A single GoogLeNet achieves a top-5 error rate of 10.07%. With a 7-model ensemble and an aggressive cropping policy during testing, the error rate can be slashed to 6.67%. In comparison, VGG, the first runner-up in the same competition has an error rate of 7.32% while AlexNet in the year of 2012 has an error rate of 16.4% if trained without external data.
Its impressive performance comes down to the novel Inception module. In fact, GoogLeNet is but an epitome of this design heuristic. In the paper, the authors described GoogLeNet as a “particular incarnation of the Inception architecture”. So let’s look into the Inception module
As shown below, the naive version of Inception module is just a set of filters of different sizes along with a max pooling block. The outputs of these individual components are concatenated in the channel dimension to form the output of the inception module.
The authors argue that on the feature map of a specific layer, some statistical relationships are local, while other relationships are formed by “feature pixels” that are more spatially spread out. If we are to cover all these relationships effectively, we should use filters of different sizes. 1x1 filters can capture local statistics while 3x3 and 5x5 can be used for statistics that are more spread out.
Comparing this with VGG family which exclusively uses 3x3 filters. If we talk about the receptive field of the filters on the original image, VGG’s filters of the same layers are only collecting statistics from regions of the same size. In comparison, every Inception module on GoogLeNet gets to collect statistics from regions of different sizes. This gives GoogLeNet more representational power. (This illustration is of course not entirely accurate, as after certain depth, all filters are already looking at the whole image. But you get the idea.)
However, the naive version of the module is impractical. As we go deeper into the network, the channels of feature maps increase. The computation of filters on these feature maps increases quadratically. Computation for 3x3 and 5x5 filters may be too big. Therefore, we should reduce the number of feature map channels for 3x3 and 5x5 filters.
1x1 filters in GoogLeNet are not only used to capture local statistics, but also to reduce the input channels. For example, suppose an input feature map’s dimension is 14x14x512, we can reduce the number of channels of this feature map by passing this though 24 1x1 filters. The resulting feature map will then have a dimension of 14x14x24. Now we can apply larger filters on this reduced feature map with less computation required. This explains the 1x1 convolution before 3x3 and 5x5 convolution in the final Inception module (figure below). There is also a 1x1 convolution behind the 3x3 pooling block. The effect of this 1x1 filter is similar, it ensures the output channels are kept in check. Without this 1x1 filter, the output channels will only increase monotonically.
Let’s do a bit of math to demonstrate the reduction in computation with 1x1 filters. Take the 3rd branch (5x5) of the naive version as well as that of the final Inception module. Take inception-4b (see table below) as an example. It has an input size of 14x14x512 and the 3rd branch has 64 5x5 filters. If we use the naive version, the third branch will incur 14x14x512 x 5x5x64 =161M floating point multiplication. With the reduction, the floating point multiplication is reduced to 14x14x512 x 1x1x24 + 14x14x24 x 5x5x64 = 10M. This amounts to a reduction of about 16 times!
Dual-purpose. We should take note that these 1x1 filters used for reduction also have ReLU activation placed behind them. Thus, in addition to dimension reduction, they also add extra non-linearity to the layer.
When we treat the Inception modules as a basic building block, GoogLeNet condenses back into a linear model for inference (for training, it requires two auxiliary side classifiers. Getting to it soon!).
The table below and the diagram above illustrate its structure very clearly. The “S” and “V” in the diagram refers same padding and valid padding. It is the way to specify padding in TensorFlow. Here and here are stackoverflow answers explaining them. We can use the formulae in these answers to compute the corresponding paddings that we need to add in PyTorch.
Note: in this article, we ignore the Local Response Normalization layers.
GoogLeNet has 22 convolutional/linear layers (if we sum up the depth column in the table). It was considered very deep at that time.
Large depth generally induces concerns in gradient backpropogation, efficiency of the model and overfitting. To alleviate these issues, the authors proposed adding two auxiliary classifiers to the intermediate layers.
The auxiliary classifiers are due to two considerations. The first one comes from the insight that many shallow networks can have strong performance on the image classification tasks. It means that the feature maps after a few rounds of convolution should already be discriminative enough. Therefore, we attach classifiers to lower layers to ensure this property. In the process, we also force the parameters of the lower layers to generate more interpretable features, preventing them from just projecting the features to whatever space that best fit the training data. This constraint on the parameters offers regularization effects. (I have to admit that the last two sentences are my own rationalization of how the regularizing effects come about. I may have committed the mistake of over-interpreation 😐).
The second consideration comes from gradient flow. Many gradients in the backward pass can be killed by the ReLU unit or diminished by small weights/intermediate features, the auxiliary classifiers can provide more direct gradients to the lower layers, enhancing their training.
The final loss for our optimizer will be
main_loss + 0.3 * aux_loss_1 + 0.3 * aux_loss_2 , as the auxiliary losses areweighted by a factor of 0.3.
Don’t be daunted by the table and long diagram of GoogLeNet! Its actual implementation is in fact very straightforward. Just like how we understand its structure, once we finish implementing the Inception module, the whole GoogLeNet strcture is almost as linear as our previously coded AlexNet and VGG.
Firstly, let’s implement a helper class called
Conv2dBn . We use this class to retrofit batchnorm to GoogLeNet by appending a
nn.BatchNorm2d layer behind every
Next, let’s implement the core of GoogLeNet,
Inception . The arguments for the
__init__() constructor refers to the channels for input feature maps and number of filters for each convolution, except
use_bn which specifies if we are retrofitting batchnorm. In
torch.cat() is used to concatenate the output of different branches.
There are two thing that is worth pointing out. For the official implementation of GoogLeNet in torchvision, the 5x5 convolution is replaced by 3x3 convolution. Yes, they used two 3x3 branches. It is known mistake as explained here. In our implementation, we stay true to the original paper.
Secondly, notice the
ceil_mode = True for the
nn.MaxPool2d() ? By default
ceil_mode is set to
False . The difference is that usually when we compute the output dimension, the formula is
O = floor((W-K+2P)/S) + 1 . If
ceil_mode is set to
True , we will be using ceiling instead of floor in the formula. This extra flag is set to
Trueto match the “same” padding implementation in TensorFlow.
We need one last ingredient before assembling them into the final GoogLeNet. Let’s code the
SideClassifier now. We did not go through the structure of the side classifier in the article, but its structure can be easily found at the bottom of page 6 in the paper.
Now, we can define the GoogLeNet class. With all the previous preparation, coding the main structure is as easy as copying the numbers from the table. Below is the code for
self.stem combines the few standalone layers in the beginning of the model.
self.classifier are the auxiliary and main classifiers. All the inception modules can be initialized by copying the corresponding number of channels from the table.
forward() method is coded below. The two auxiliary classifiers are attached after
inception_4d respectively. If the model is in training mode i.e. by default or after we called
model.train() , we will return the outputs of all classifiers. Else (if we called
model.eval() ), we will only return the main classifiers’ output for inference.
So far, we have finished coding for GoogLeNet. However, to simplify our training script, we can create a helper class called
GoogLeNetWithLoss . It includes the forward pass through the loss function, and weights the main loss and auxiliary losses.
With this, we have finished implementing GoogLeNet. If you have some idle GPU time, feel free to plug it into our training scripts for training!
As mentioned in the introduction, the use of “Inception” is not unique to GoogLeNet. Many later works expanded on the idea of “cardinality”, the number of paths/branches in the Inception module, to build even more powerful models. Many of them are as inspiring as GoogLeNet. Hope we may take a look at them in some later articles.
Based on the three networks we have learnt so far, it seems that deeper networks are generally more powerful. However, gradient flow can be an issue for deep architectures. GoogLeNet used auxiliary classifiers to overcome this hindrance, but is there a more systematic and elegant design to solve this problem?
Yes, there is. In the next article, we will go through skip connections and one of the most popular network families in computer vision, the ResNet family.