Review: TDM — Top-Down Modulation (Object Detection)

Encoder-Decoder Architecture Using TDM with Faster R-CNN

In this story, TDM (Top-Down Modulation) is shortly reviewed. It is found that by combining high-level and low-level features using TDM, many hard objects can be detected, and thereby a significant boost on the COCO benchmark. I chose to review this paper because some of the later state-of-the-art approaches such as YOLOv3 and RetinaNet selected TDM for comparison. This shows that TDM has its certain importance in the aspect of object detection. It is a 2017 arXiv tech report with over 70 citations. (Sik-Ho Tsang @ Medium)


Outline

  1. TDM Network
  2. Details of TDM
  3. Ablation Study
  4. Results

1. TDM Network

TDM Network
A Basic TDM Module
  • At the bottom-up path, it is a standard conv path for feature extraction. However, the feature maps getting smaller and smaller and the location information is lost.
  • At the Top-down path, TDM is used for enlarging the feature map gradually with the help of feature map at the bottom path, as we can see at the basic TDM module.
  • Finally, we can have ROI proposal and ROI classifier.
  • Actually, there were many concurrent works working on encoder-decoder architecture at that moment. For example, DSSD for object detection, SharpMask for instance segmentation, U-Net for biomedical Image Segmentation, and RED-Net for image restoration. And TDM is the one based on Faster R-CNN for object detection.

2. Details of TDM

Details of TDM

2.1. TDM Structure

  • Bottom-up feature goes through 3×3 conv (L2), this is called lateral module.
  • Top-down feature goes through 3×3 conv (T3,2) and then up-sampled to match the higher resolution if necessary. (No upsampling by T4.)
  • They are then concatenated and go through 1×1 conv (T2out) to become the output feature of TDM.
  • The output feature will then go to the next TDM as Top-down feature.

2.2. Training

  • Pre-trained bottom-up network is used.
  • And the top-down TDM is progressively added one by one. That means, for example, (L4, T5,4) is added and then trained for object detection. After that, (L3, T3,3) is added and then trained, and so on.

3. Ablation Study

VGG-16, ResNet-101, and Inception-ResNet-v2 are used as the backbone for Fast R-CNN in the experiments.

3.1. How low should the Top-Down Modulation go?

All methods are trained on trainval and evaluated on minival set on COCO
  • Skip-Pool: Similar to ION, instead of using top-down modules, features are obtained at different layers, then L2-normalized, concatenated and scaled back.
  • For VGG-16+TDM, there is degradation from 29.9% to 29.8% mAP when one more TDM is added. I guess there is difficulty on convergence due to the absence of skip connection.
  • For ResNet-101+TDM, 35.7% mAP is obtained.
  • For Inception-ResNet-v2+TDM, 38.1% mAP is even achieved.

3.2. No Lateral Modules

No Lateral Module (Left), Have LAteral Module (Right) Using VGG-16
  • A large margin is obtained when a lateral module is used which shows that lateral module is important.

3.3. Pre-Training

Impact of Pre-training
  • Pre-training on COCO is a bit better.

4. Results

COCO Results from Paper (Top), Re-implemented Fast R-CNN (Middle), TDM (Bottom)

4.1. Overall AP

4.2. Improved Localization

  • If we look at AP⁷⁵, comparing TDM (bottom) with the baseline Faster R-CNN variants (Middle), AP⁷⁵ is improved by large margin.

4.3. Improvement on Small Objects

  • If we look at AP^S, comparing TDM (bottom) with the baseline Faster R-CNN variants (Middle), AP^S is improved by large margin as well.

4.4. Qualitative Results

COCO minival set