Scratch to SOTA: Build Famous Classification Nets 2 (AlexNet/VGG)

Published in

The Startup

11 min readAug 10, 2020

Introduction

In the last article, we reviewed how some of the most famous classification networks are evaluated on ImageNet. We also finished building a PyTorch evaluation dataset class, as well as, efficient evaluation functions. We will soon see how they come handy in validating our model structures and training.

We will build AlexNet and VGG models in this article. Despite their influential contribution to computer vision and deep learning, their structures are straightforward in retrospect. Therefore, in addition to building them, we will also play with “weight porting” and sliding window implementation with convolutional layers.

I personally find borrowing weights from other models a simple but useful technique to practice. Other than using it to double check our model structure, it can also be used for porting model weights between different frameworks, as well as, initializing the backbone of a detector or a modified network by reshaping the weights (we will dabble in it for our sanity check).

Cool, so let’s begin.

Overveiw

Explaining the structures of AlexNet and VGG family
Building our own library modules for them
Implementing sliding window with convolution
Replacing the fully connected classifier head with a convolutional classifier head for “dense evaluation”
Discussion on weight initialization
Sanity check by porting weights from the pretrained models

AlexNet

AlexNet is often regarded as the model that marked the dawn of current deep learning era. It was the winner of ImageNet 2012 with a top 5 accuracy of 15.3%, outperforming the runner-up of the year by a whopping 10.9%.

It has 60 million parameters and given GPUs’ limited memory nearly 10 years ago, AlexNet had to be split and trained across two GTX 580 3GB GPUs. For this reason, it can be confusing to decipher the exact structure of AlexNet for one GPU from the original paper. In fact, the official PyTorch implementation of AlexNet takes reference from this paper (check footnote 1 on page 5), although PyTorch’s implementation still differs from the paper by using 256 kernels in the 4th convolutional layer instead of the 384 described in the paper (aargh!).

We will also ignore the Local Response Normalization (LRN) feature of the network. While the paper states that LRN reduces top1 and top5 by a non-negligible amount (1.4% and 1.2%), it is not often used nowadays as its effects are insignificant for most networks. Batch normalization is the default scheme to apply if we want the model to learn better. We will add that for our VGG networks.

In this article, we will follow PyTorch’s AlexNet structure so that we can use its pretrained weights. The structure we will use is illustrated in the following figure. The spatial dimension of the feature maps after convolution/max-pooling can be computed as floor(W - F + 2P) / S) + 1where Wis the current feature map’s width/height, Fis the filter’s size, Pis padding size and Sis stride. floor() means we round down to the nearest integer. The “thickness” or “depth” of the output feature maps depend on the number of kernels (filters) we use at current stage.

With PyTorch library, this AlexNet structure is pretty easy to implement. In the __init__() method, we can first ignore the head argument which controls the what kind of classification head we want to use.

AlexNet __init__() ConvNet Feature Extractor

As we can see in the code and you may have already known that one beautiful property of convolutional layers is they do not care about your input size. Given any input size, it will generate the corresponding output size according to the formula above. However, if we append the convolutional layers with fully connected layers, then we need to consider the final feature map’s total elements. In AlexNet, the flattened feature map must be a 9216-dimension vector before going into the fully connected classifier.

There are two ways to satisfy this constraint. We can either have a resize operation to convert the input images to 224 x 224 dimension, or we can “resize” the final feature to make it 256 x 6 x 6 before put it into the fully connected layers. The second solution can be accomplished with nn.AdaptiveAvgPool2d() . It is kind of analogous to interpolation. A good explanation of it can be found here. With this bit of knowledge, we can build our fully connected classifier.

AlexNet __init__() ConvNet Classification Head

ConvNet’s Implementation of Sliding Windows

Fully connected layers are sort of annoying in neural networks for computer vision tasks. It has too much parameters and induces the strict input dimension requirement. So the question is can we do without it? Luckily, yes. In many later networks, fully connected layers are replaced by simple average pooling. In our case, if we want to stay true to the network structure, we can replace it with its equivalent convolutional implementation, ditching the need of ensuring the fixed input size.

As in the above diagram, the 256 x 6 x 6 final feature map is on the left. It can be flattened to a 9216-dimensional feature vector and passed through a 4096-unit fully connected layer (top right). For every unit (green dot), there are 9216 connections, linking each element on the feature map. It will produce a 4096 x 1 feature vector. In this way, each FC unit is effectively a 256 x 6 x 6 kernel (bottom right). We can thus replace the 4096-unit FC layer with a convolutional layer of 4096 256 x 6 x 6 kernels. It will produce a 4096 x 1 x 1 feature map instead. It is obvious to see that aside from the extra dimension, the two outputs are equivalent.

With this derivation, we can implement the ConvNet version of classification head. Continuing from the previous code snippet, we have:

At the first look, this conversion seems redundant. However, remember the one beautiful property of ConvNet we mentioned minutes ago? Now we can allow the input to have any sizes as long as they are larger than 224 x 224.

For example, if we have an input image with size that is between 287 x 287 and 318 x 318, we will get a final feature map that is 256 x 8 x 8. As our fully connected layers can only take in a flattened feature vector corresponding to a 256 x 6 x 6 feature map, we have to apply nn.AdaptiveAvgPool2d() . Alternatively, we can try “dense evaluation”, that is to put a sliding windows across the feature map and get 9 crops of 256 x 6 x 6 feature map and put them into the FC layers one-by-one to generate final outputs (Figure below, top). The outputs can then be averaged.

However, if we are using the ConvNet’s implementation of the FC layers, we are naturally using sliding windows (Figure above, bottom). The output feature map can be averaged across the spatial dimension to get the final output.

With ConvNet’s implementation of the classification head, we can now be lenient with our input image size.

Let’s write the last bit of code to complete our AlexNet implementation. The forward() method of the class is:

In the same script, we can define builder function that helps us generate the specified AlexNet. When we have pretrained weights ready, we can load the weights into the network here.

With that, we have finished our AlexNet model. We can now move on to sanity check with weight porting.

Sanity Check with Weight Porting

We are going to initiate a pretrained AlexNet from torchvision and copy its weight to our model. In this section, we will go through how weights are indexed and how to reshape the weights for our ConvNet implementation of FC classifier.

Firstly, we initiate all three networks:

Let’s print out all the states and parameters of the three networks. If we run the script at this stage, we should get something like below.

Torch name: features.0.weight         torch.Size([64, 3, 11, 11])
FC name   : features.0.weight         torch.Size([64, 3, 11, 11])
Conv name : features.0.weight         torch.Size([64, 3, 11, 11])Torch name: features.0.bias           torch.Size([64])
FC name   : features.0.bias           torch.Size([64])
Conv name : features.0.bias           torch.Size([64])Torch name: features.3.weight         torch.Size([192, 64, 5, 5])
FC name   : features.3.weight         torch.Size([192, 64, 5, 5])
Conv name : features.3.weight         torch.Size([192, 64, 5, 5])...Torch name: classifier.0.weight       torch.Size([4096, 25088])
FC name   : classifier.2.weight       torch.Size([4096, 25088])
Conv name : classifier.0.weight       torch.Size([4096, 512, 7, 7])Torch name: classifier.0.bias         torch.Size([4096])
FC name   : classifier.2.bias         torch.Size([4096])
Conv name : classifier.0.bias         torch.Size([4096])Torch name: classifier.3.weight       torch.Size([4096, 4096])
FC name   : classifier.5.weight       torch.Size([4096, 4096])
Conv name : classifier.3.weight       torch.Size([4096, 4096, 1, 1])...

To transfer the weights from PyTorch’s pretrained AlexNet to our AlexNet with FC classification head, we can create an OrderedDict() and store the pretrained AlexNet’s weights with our model’s parameter name. We will then load this OrderedDict to our model.

This process can be repeated for our AlexNet with Convolutional head. However, as convolutional layers’ weights and FC layers’ weights have different shape, we need to reshape the weights for our classifier.

During the loading of the weights, there is no error, indicating that we have got the weights’ name and shapes right. However, to guarantee that we have got everything correct, we should test our model with the evaluation script we wrote in the last article.

The torchvision’s pretrained AlexNet and our AlexNet have exactly the same accuracy, indicating we have done everything correctly.

Next, let’s conduct the same center-crop evaluation with our convolutionally headed AlexNet. As this AlexNet outputs a feature map instead of a feature vector, we need to first write a wrapper that will average the model’s output.

Then, we can pass the wrapped model to our evaluation function to get the outcome.

Finally, we can test the “dense evaluation” by passing images of size larger than 224 x 224 and average the output for prediction.

As can be seen, the dense evaluation is about 1.3% more accurate than the center-crop evaluation. Yet, because most of the computations from the convolutional feature extractor are shared, the evaluation time isn’t too much longer.

Hooray! now we have our own enhanced AlexNet.

VGG

When I first studied neural networks, I felt overwhelmed by the huge combinatorial possibilities of network structure. How deep do I need to go? What would the kernel sizes be? How many kernels should I use? What should the strides be?… AlexNet did not help with my queries. There isn’t too much pattern in the AlexNet that we can follow. Then VGG networks come to rescue!

VGG family is a landmark in deep learning, not only because it was the runner-up in 2014 ImageNet competition (with an impressive 7.3% top 5 error rate), but it also helped standardise the structure of networks.

Patterns in VGG

Use 3 x 3 convolutional kernels across the network. Two 3 x 3 kernels stacked together have a 5 x 5 receptive field (i.e. one element on its output feature map is derived from a 5 x 5 region on the input image). Three of it stacked together will have a 7 x 7 receptive field. If we use a stack of three 3 x 3 kernels instead of a 7 x7 kernel, not only do we have the same receptive field, we can imbue it with three times more non-linearities (with ReLU), as well as, use a lot less parameters (3x(3x3xCxC) vs 7 x 7 x C x C, assuming the input map and output map both have C channels).
Only use max-pooling of size 2 and stride 2 to downsample feature map. Unlike AlexNet whose some convolutional layers also downsample the feature map, VGG only downsamples the feature map with max-pooling. This means all the convolutional layers have stride 1.
Double the channels after every downsampling. As the feature map’s width and height are halved, its channels are doubled with twice as much kernels applied at each convolutional layer.

Now we have the luxury of restricted choices while building our networks. In fact, many later networks also adopt these patterns in their structure. We seldom see the use of large kernels nowadays, and the heuristic of only doubling channels after downsampling gives rise to the concept of “module”, a stack of convolutional layers with the same number of kernels.

With the above heuristic, the structure of the whole VGG family can be summarized with a single table as in the paper.

With this structured design, we can easily code all the networks in the VGG family.

The code here takes heavy reference of the official code for torchvision’s VGG models. I value-added (hopefully) with detailed remarks.

We first define the structure of the VGG networks. It is basically putting the table above into a dictionary.

Next, we can define the class’s __init__()and forward()methods:

Same as our AlexNet, we will also implement convolutional classification head for VGG (“dense evaluation” originates from VGG paper after all). bn(batch normalization) argument determines if we will include batch normalization into our network. Batch normalization is a technique that came after VGG, we will retrofit it into VGG. It increases both the accuracy and training speed.

We now move onto the _get_conv_layers()method.

As mentioned in the comments in the code snippet above, if we choose to add batch normalization, it is usually added after convolution but before non-linearity.

The classification head’s code is very straightforward too. One thing to note is that dropout, unlike batch normalization, is usually added after activation while before convolution.

Weight Initialization

We are almost done here. However, different from AlexNet whose weights can all be more casually initialized with zero-meaned normal distribution with a standard deviation of 0.01, caution has to be taken in initializing the weights of VGG networks as it is much deeper and does not converge easily. In fact, for ImageNet competition, the authors of VGG first trained shallower versions of the network and then slowly added more layers to make it deeper.

We do not need to go through that tedious process ourselves, as with Kaiming or Xavier initialization, we can train the whole deep network from scratch. This pdf slides explains the two types of initialization quite well.

However, there are a few confusing choices to make. Firstly, for each type of initialization, should we use a Gaussian distribution or uniform distribution? This stackexchange discussion mentions that for Xavier initialization, uniform distribution seems be to slightly better, while for Kaiming initialization, Gaussian distribution is used for all layers in the original ResNet paper, so I guess we can go with Kaiming normal and Xavier uniform distribution.

The second question is that for Kaiming initialization, there are “fan-in” and “fan-out” two modes, which one to use? This PyTorch forum discussion states that “fan-in” should be the default mode. It sounds good, except as mentioned in the forum discussion, torchvision’s ResNet as well as VGG used “fan-out”. I did quite a bit of search online but could not find explanation for this choice.

In this script, I decided to use “fan-in” mode, so the code looks like this:

Just like in AlexNet, we can write some builder functions. Two examples are below:

With this, we completed our implementation of VGG family.

Kudos!

Sanity Check

The sanity check for VGG is the same as the one we wrote for AlexNet above. As with AlexNet, “dense evaluation” achieves higher results.

The completed codes for the models can be found in this repository.

Conclusion

In this article, we implemented AlexNet and VGG family. The networks themselves are not difficult to implement, but the idea of using convolutional layers to implement sliding windows, as well as, weight initialization and porting may be tricky to understand.

In the next article, we will write a training script. We will discuss training data augmentation, PyTorch’s data parallelism and distributed data parallelism.