Review: NoC — Winner in 2015 COCO & ILSVRC Detection (Object Detection)

NoCs: Combining Faster R-CNN and Residual Network, With Maxout, Won 2015 COCO & ILSVRC Detection Challenges

MS COCO (http://cocodataset.org/)

In this story, NoCs, “Networks on Convolutional feature maps”, by University of Science and Technology of China, Microsoft Research, Jiaotong University, and Facebook AI Research (FAIR), is reviewed. Despite the effective ResNet and Faster R-CNN added to the network, the design of NoCs is an essential element for the 1st-place winning entries in ImageNet and MS COCO challenges 2015. And it is published in 2017 TPAMI with over 100 citations. (Sik-Ho Tsang @ Medium)


What Is Covered?

  1. What is NoC
  2. Using MLP as NoC
  3. Using ConvNet as NoC
  4. Maxout for Scale Selection (From Maxout to Maxout NoC)
  5. Other Analyses for NoC
  6. Results of NoC for Faster R-CNN With ResNet / GoogLeNet

1. What is NoC

Overview of NoC

1.1. A prevalent strategy for object detection

  • Use convolutional layers to extract region-independent features
  • Then, perform ROI pooling followed by region-wise multi-layer perceptrons (MLPs) or fully connected (fc) layers for classification.
  • This strategy was, however, historically driven by pre-trained classification architectures similar to AlexNet and VGGNets that end with MLP classifiers.

1.2. NoCs

  • stands for “Networks on Convolutional feature maps
  • focus on region-wise classifier architectures, as shown in the figure above.
  • To choose an optimal NoC, a detailed ablation study is done as below.

2. Using MLP as NoC

NoC as MLP for PASCAL VOC 07 Using a ZFNet
  • A simple design of NoC is to use fc layers only, known as a multilayer perceptron (MLP).
  • 2 to 4 fc layers are investigated.
  • The last fc layer is always (n+1)-d with softmax, and the other fc layers are 4,096-d with ReLU.
  • Without any pretraining on NoC, 4fc NoC as a classifier on the same features has 7.8% higher mAP.
  • In the special case of 3fc layers, the NoC becomes a structure similar to the region-wise classifiers popularly used in SPPNet and Fast/Faster R-CNN.

3. Using ConvNet as NoC

NoC as ConvNet for PASCAL VOC 07 Using a ZFNet
  • 1 to 3 additional conv layers with ReLU in a NoC are investigated.
  • The VOC 07 trainval set is too small to train deeper models.
  • Degradation is a result of overfitting.
  • NoCs with conv layers show improvements when trained on the VOC 07+12 trainval set. And the advanced 2conv3fc NoC improves over this baseline to 58.9 percent.
  • mAP gets saturated when using three additional conv layers.

4. Maxout for Scale Selection (From Maxout to Maxout NoC)

4.1. Maxout

Maxout Network with k=3
  • Maxout is invented by GoodFellow who invented GAN. This is a 2013 ICML paper called “Maxout Networks” which has over 1300 citations.
  • It is named Maxout because its output is the max of a set of inputs, and because it is a natural companion to dropout.
  • It is used as one kind of activation functions.
  • As shown in the figure above, the purple-pink area is the Maxout Network.
  • A maxout feature map is constructed by taking the maximum across k affine feature maps.

4.2. Maxout NoC

A maxout NoC of “c256-mo-c256-f4096-f4096-f21”
Maxout NoC for PASCAL VOC 07 Using a ZFNet
  • Here, Maxout NoC is that the two feature maps (for the two scales) are merged into a single feature of the same dimensionality using element-wise max.
  • There are two pathways before the Maxout, their weights are shared. Thus, the total number of weights is unchanged when using Maxout.
  • 4 variants of Maxout are better than the non-Maxout NoC.
  • However, besides Maxout, there are many alternative ways to merge two feature maps, e.g.: 1) Simply element-wise added together, 2) Concatenation with/without L2 normalization, then 1×1 convolution to reduce the dimension just like U-Net or ParseNet, or 3) element-wise multiplication just like DSSD. An ablation study should be made on this.

5. Other Analyses for NoC

5.1. Fine-Tuning

NoC for PASCAL VOC 07 Using ZF/VGG-16 Nets with Different Initialization
  • With VGG-16 and fc layers in the pre-trained model, and with additional conv layers initialized to the identity mapping, initial network state is equivalent to the pre-trained three fc structure.
  • 68.8% mAP is obtained.

5.2. Error Analysis

Distribution of top-ranked True Positives (TP) and False Positives (FP), Cor (correct), Loc (false due to poor localization), Sim (confusion with a similar category), Oth (confusion with a dissimilar category), BG (fired on background).
  • The localization error is substantially reduced compared with the three fc baseline.
  • NoCs mainly account for localizing objects.
  • Localization-sensitive information is only extracted after RoI pooling and is used by NoCs.

6. Results of NoC for Faster R-CNN With ResNet / GoogLeNet

After studying NoC using Fast R-CNN with ZFNet or VGGNet as above, we can conclude that using ConvNet as NoC is the optimal NoC architecture.

6.1. MS COCO

Detection Results of Faster R-CNN on the MS COCO Val Set
  • NoC for Faster R-CNN With ResNet or GoogLeNet is used.
  • With ResNet-101 (feature map enlarged by hole algorithm), feature extracted at res4b22, using ConvNet as NoC (res5a,5b,5c,fc81), 27.2% overall mAP is obtained.
  • The results starting from below are from the supplementary section in the ResNet paper.
Object detection improvements on MS COCO using Faster R-CNN and ResNet-101.
  • Box Refinement: For inference, a new feature is pooled from the regressed box and obtain a new classification score and a new regressed box. 29.9% mAP is obtained on val set.
  • Global Context: Given the full-image conv feature map, a feature is pooled by global Spatial Pyramid Pooling with a “single-level” pyramid (SPPNet). This global feature is concatenated with the original per-region feature. 30.0% mAP is obtained on val set, and 32.2% mAP is obtained on test-dev set.
  • Multi-Scale Testing: With a trained model, conv feature maps are computed on an image pyramid, where the image’s shorter sides of {200, 400, 600, 800, 1000}. Two adjacent scales from the pyramid are selected, ROI-pooled, and merged by Maxout. 34.9% mAP is obtained on test-dev set.
  • Ensemble: With an ensemble of 3 networks, 37.4% mAP is obtained on the test-dev set. This result won the 1st place in the detection task in COCO 2015.

6.2. PASCAL VOC

Detection results on the PASCAL VOC 2007 test set
  • With the single model on the COCO dataset, the model is fine-tuned on the PASCAL VOC sets.
  • The system “baseline+++” includes the Box Refinement, Global Context, and Multi-Scale Testing, mentioned in 6.1 above. (No ensembling, only single model)
  • 85.6% mAP is obtained on PASCAL VOC 2007 test set.
Detection results on the PASCAL VOC 2012 test set
  • Similarly, 83.8% mAP is obtained on PASCAL VOC 2012 test set.

6.3. ILSVRC Detection (DET) Task

ImageNet detection dataset
  • The networks are pre-trained on the 1000-class ImageNet classification set, and are fine-tuned on the DET data.
  • Also with Box Refinement, Global Context, and Multi-Scale Testing, 58.8% mAP is obtained using single model.
  • With ensemble of 3 models, 62.1% mAP is obtained. This result won the 1st place in the ImageNet detection task in ILSVRC 2015.

To understand NoC, it is recommended to read Maxout Network, NoC, and the supplementary section of ResNet downloaded from arXiv.