Scratch to SOTA: Build Famous Classification Nets 1 (Evaluation)

Wayne
The Startup
Published in
10 min readJul 26, 2020

This series of articles are like documentation for the PyTorch codes I am writing. I include detailed remarks here. It is partly for me to (re-)organize my codes and thoughts, and partly for the deep learning newbies, who want some guide to build famous networks and train them to their SOTA performance.

(source: https://blog.paperspace.com/convert-full-imagenet-pre-trained-model-from-mxnet-to-pytorch/)

Introduction to the Series

As a deep learning practitioner, I constantly apply the newest models to the problems at hand for desirable performance. Given that machine learning is presently a very happening field and its practitioners are open to sharing, the state-of-the-art models with pretrained weights are easily accessible to everyone. I find myself seldom building/training a randomly initialized model from scratch, but most of the time fine-tuning them on the specific datasets.

It is nothing to complain about… Except… times when I feel unsettled about whether I applied the optimal training techniques for the specific dataset, or times when I regret about the insight and skills that I might have attained if I reproduce others’ research TRULY from SCRATCH.

Therefore, here I am, soothing my agitation and rectifying my regrets.

I am sure there are many people like me who won’t be fully assured until having tried it ourselves. The codes and series are for us.

The models that I plan to build and (hopefully, with enough vacant GPU time) train are:

  • AlexNet
  • VGG families
  • GoogLeNet
  • ResNet families
  • MobileNetV1
  • MobileNetV2
  • EfficientNets

Preparation

  1. The intended audience for this series of articles are people who already know the basics of deep learning and PyTorch. I will focus on implementing the networks, training and evaluation scripts and explain the rationale behind the particular implementations. If you have finished the beginners’ tutorials on PyTorch website, you should be able to follow through.
  2. Download ImageNet data and sort both the training data and validation data into folders of their categories, so the structure will look something like this:
|---root
|---train
|---class1
|---class2
...
|---class1000
|---val
|---class1
|---class2
...
|---class1000

Yes, it is quite painful to wait for the download to finish. We have all been there. Be patient and it will be fine.

Introduction

When we think of deep learning, the complex structure of layered networks is likely the first thing that comes to mind. However, in practice, data (pre-/post-)processing, training and evaluation are the components that require a lot of thoughts and efforts. Building a popular network for project frequently involves that one line of built-in-function-calling code. In this article, let us first build a good customized evaluation Dataset class and evaluate function.

They will be our guide later for checking if we have got the model structure and training right.

Overview

  • Reviewing how some famous models are evaluated on the ImageNet
  • Intuition on “dense” evaluation (from Overfeat, used in VGG/ResNet evaluation)
  • Implementing dataset class (inheriting and modifying ImageFolder class) and various transformation that is efficient for different evaluation.
  • Sanity check with TensorBoard
  • Implementing efficient evaluation functions and helper classes
  • Sanity check by evaluating pretrained models with different evaluation configuration

Evaluation Dataset

The evaluation metrics for ImageNet classification changes are top1 and top5 accuracy. As the network outputs a vector of probabilities/confidence levels over all classes that an image can belong to (1,000 classes for ImageNet), Topk accuracy means that the model’s prediction is considered right if the ground-truth label for the image is among the k highest-confidence classes predicted by the model. Then we compute the percentage of the images that the model get right under this criterion.

Firstly, we will build a evaluation dataset class. To give robust prediction, test time augmentation is often employed by the model. A model will take different crops of the input image and generate a confidence vector for each crop. These crops’ confidence vectors are then averaged, in order to make the final prediction on the image.

For an efficient implementation of different test time crop augmentation, we hope that this dataset can help us carry out different croppings that we need. To figure out the the requirements for this class, let’s check how some of the famous networks are evaluated on ImageNet.

Summary on how models are evaluated on ImageNet

Brief Explanation on “Dense” Evaluation

The summary above focused on earlier models. Most of the evaluations should be clear enough as they are just different ways of cropping an image.

I want to give some intuition on the “dense” method listed above. Here are the paper and Andrew Ng explaining it (using convolutional layers instead of fully connected layers for sliding window implementation). The idea is that as we do not know where the object is on the image, we should pass a sliding windows pixel-by-pixel over the image and collect all the crops. Hopefully a few of the crops tightly enclose the target object so that our model can make a more accurate prediction on them. When we average all the predictions, we hope these tightly-enclosing crops can help bring up the accuracy.

However, in this way, we need to crop hundreds of or thousands of crops and the evaluation can be very slow. Therefore, in stead of dragging a sliding window over the image, we split the “convolutional feature extractor” and the “fully-connected classifier”. We extract the feature map of the whole image with the feature extractor, then drag the classifier over the feature map to conduct prediction.

In this way, we only need to pass the image through the “feature extractor” part of the network once, the additional computational cost only comes from the additional times we pass the crops through the “classifier” part. The computation is thus shared for feature extractor. (Of course, as the feature map is downsampled from the input image with maxpooling and convolutional strides, the corresponding sliding window on the input image is no longer pixel-by-pixel but a-few-pixel-by-a-few-pixel.)

As can be seen above, the input size for all the networks listed above is 224x224. The main processes the validation images go through are resize, crop and flip. To comply with the current norm, we should also normalize the image by converting the pixel values from [0, 255] to [0, 1], then subtracting the mean of RGB pixel values of ImageNet and divide by their standard deviation. The ImageNet’s RGB pixel mean and standard deviation are [0.485, 0.456, 0.406] and [0.229, 0.224, 0.225]. Check this forum discussion for details.

Let’s design a class that inherits torchvision.datasets.ImageFolder . This class automatically convert an organized folder (like the structure we are using) to a dataset of image and its labels. In the script below, let us first ignore the content of _get_transform() method. It just returns a required sequence of tranforms according to how we wish the validation data to be processed.

The __init__()arguments are documented in the script. They are used in the order of rescale image size --> take center_square (?) --> how to crop the images for network input_size (no crop for dense evaluation as in VGG) --> add the crops' horizontal_flips (?) --> normalize the crops' pixel values according to the mean and std. You can check that this flow of image pre-processing can reproduce all the evaluation settings in the table above.

The arguments are then used initialize its parent class. We overwrite the class’s __getitem__() methods so that it returns a dictionary of image and label, as well as, the image’s file name when fname = True . The addition of fname is helpful for debugging when we want to know exactly which image the model is processing. It is, in fact, the main reason for us to overwrite this method.

Evaluation Dataset Class

Now let’s deal with the all the transformations.

A torch transform class is a callable class that operates on PIL images or torch tensors and returns PIL images or torch tensors. A list of transform classes can be chained together with torchvision.tranforms.Compose() so that they can operate on images in sequence.

Torchvision module has many built-in transform classes, however, most of them can only operate on a single image or tensor. As we need to get crops from an image and often crop the crops again, all transform classes need to be able to operate on a list of images/tensors and return a list of them. Most of the transform classes in the scripts below are a thin wrapper to achieve this purpose.

Let’s trace back a bit to the EvalDatasetclass, for _get_transforms() method, the final transformation to append to list is tsfm.Lambda(lambda crops: torch.stack(crops)). It stacks the list of torch tensors to a 4D tensor of shape [num_crops, channels, h, w].

To summarize, the transforms returned by _get_transforms() map a PIL image to a 4D tensor of its required crops.

Sanity check

We put all the scripts above into a file called root_folder/data/dataset.py. At the bottom of the file, let us check if our data processing set-up is correct.

Checking the evaluation data processing set-up

The console output should resemble the output below.

AlexNet eval shape:  torch.Size([10, 3, 224, 224])
Label: 13
Filename is: ('../../datasets/imagenet/ILSVRC2015/Data/CLS-LOC/val/n01534433/ILSVRC2012_val_00017970.JPEG', 13)
VGG16_dense eval shape: torch.Size([1, 3, 256, 316])
Label: 13
Filename is: ('../../datasets/imagenet/ILSVRC2015/Data/CLS-LOC/val/n01534433/ILSVRC2012_val_00017970.JPEG', 13)
VGG16_Multi_Crop eval shape: torch.Size([150, 3, 224, 224])
Label: 13
Filename is: ('../../datasets/imagenet/ILSVRC2015/Data/CLS-LOC/val/n01534433/ILSVRC2012_val_00017970.JPEG', 13)
GoogLeNet eval shape: torch.Size([144, 3, 224, 224])
Label: 13
Filename is: ('../../datasets/imagenet/ILSVRC2015/Data/CLS-LOC/val/n01534433/ILSVRC2012_val_00017970.JPEG', 13)

If we change working directory to the folder where tensorboard event files are stored and type the command tensorboard --logdir ./ and access the corresponding port, we can see how the processed images look like for different data processing configurations.

The tensorboard IMAGES page will look something like the screenshot below. The different crops of images are shown. We are now confident that the our evaluation image processing routine is working

Augmented Images in Tensorboard

Evaluation

Now we can write an evaluation function for ImageNet. It provides a way to validate our models in the future.

Firstly, we write a function to compute the average accuracy given the model’s (averaged) probability output and the corresponding ground-truth labels. The probabilities and labels are torch tensors of shape (batch_size, num_categories) and (batch_size, 1) . The labels are integer in the range of [0,hm_categories), which is consistent with our evaluation dataset’s output processing scripts and different from some scripts which may use one-hot vectors as labels.

Then we can write helper class called AverageMeter . It is very useful when we need to track the average of a value. It will be used again in our training script where we need to record the average value of loss across an epoch.

Finally, we create an evaluate function which takes in a model and dataloader and compute the model’s top-k accuracy for the dataloader’s data. Take note of the model.eval() , I sometimes make the rookie mistake of forgetting to turn the state of model to eval and waste much time figuring out why the model’s performance is so bad.

The dataloader will stack several dataset[i] together and return a 5D tensor of shape (batch_size, num_crops, channels, h, w) . As pytorch model takes in tensor of shape (batch_size, channels, h, w) , we squeeze the 1st and 2nd second dimensions before passing it to pytorch models. For model output, we reverse the process and reshape the output into (batch_size, num_crops, num_categories) .

As in some evaluation procedure, the average of multiple crops’ predictions is used (which is precisely the reason why we have the num_crops dimension in model input and output), we use torch.mean() to average the probabilities across the num_crops dimension before passing the final tensor to our accuracy() function.

Sanity Check

For the final stage of this article, let’s check the evaluation functions along with our evaldataset.

Let’s put the class AverageMeter in root_folder/tools/utils.py . We will add more utility functions into this file along this series of articles. The accuracy() and evaluate() function go to the file root_folder/tools/eval.py .

At the bottom of root_folder/tools/eval.py , let’s conduct our sanity check.

We create and load a pretrained pytorch model using that one line of code (told ya) model = torchvision.models.alexnet(pretrained=True). The dataset_config dictionary in the script above specifies the center single crop data processing configuration. As the model are trained with images normalized with ImageNet mean and standard deviation, the meanand stdcannot be changed. Feel free to change the model or other parameters in dataset_config.

There is one caveat, however. For “dense” evaluation, we would set "crop" = None . This works in the script but it does not conduct the real “dense” evaluation as in the VGG paper. The mismatch in dimension between the final feature map and fully-connected layers was solved by nn.AdaptiveAvgPool2d . For real “dense” evaluation, we need to replace the fully-connected layers with convolutional layers. We will include that option in our models to be built.

You should get similar results as below when you run eval.py from utils folder.

alexnet
single center crop
Top 1 accuracy: 0.5651800000572205
Top 5 accuracy: 0.7906999998283386
10 crops
Top 1 accuracy: 0.5943400001716613
Top 5 accuracy: 0.8127000001144409
Fake dense evaluation
Top 1 accuracy: 0.5750200000572204
Top 5 accuracy: 0.8012999999237061
Gridcrops with rescale_sizes = [224, 256, 288]
(yes, 224 gives repeated crops, but oh well)
Top 1 accuracy: 0.5999799847789109
Top 5 accuracy: 0.8181799780726433
vgg16
single center crop
Top 1 accuracy: 0.7159200002861023
Top 5 accuracy: 0.9038200002670288

The “single center crop” for these two models are almost the same as the results published in the official torchvision.models page, indicating that we have done things correctly. Kudos.

Conclusion

In this article, we first reviewed how some famous classification networks are evaluated on ImageNet. We then built an efficient dataset class along with various transformations for test time image augmentation. Finally we implemented our evaluation functions. It is a good beginning as we now have a “supervisor” at our disposal to check our future networks and training.

In the next article, we will build AlexNet and VGG families from scratch.

--

--