Review — CoupleNet: Coupling Global Structure with Local Parts for Object Detection (Object Detection)

With Local & Global FCN Branches, Captured Local & Global Information, Outperforms R-FCN, Faster R-CNN, SSD, & ION.

Sik-Ho Tsang
Nerd For Tech
6 min readApr 4, 2021

--

A toy example of object detection by combining local and global information

In this story, CoupleNet: Coupling Global Structure with Local Parts for Object Detection, (CoupleNet), by Chinese Academy of Sciences, University of Chinese Academy of Sciences, Nanjing Audit University, and Indiana University, is reviewed. In this paper:

  • The object proposals obtained by the Region Proposal Network (RPN) are fed into the the coupling module which consists of two branches.
  • One branch adopts the position-sensitive RoI (PSRoI) pooling to capture the local part information of the object.
  • The other employs the RoI pooling to encode the global and context information.

This is a paper in 2017 ICCV with over 130 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. CoupleNet: Network Architecture
  2. Experimental Results

1. CoupleNet: Network Architecture

CoupleNet: Network Architecture
  • ImageNet pre-trained ResNet-101 is used as backbone, with the last average pooling layer and the fc layer removed.
  • Then each proposal flows to two different branches: the local FCN and global FCN.
  • Finally, the output of global and local FCN are coupled together as the final score of the object.

1.1. Local FCN

  • A set of part-sensitive score maps by appending a 1×1 convolutional layer with k²(C + 1) channels, where k means we divide the object into k×k local parts (k = 7) and C + 1 is the number of object categories plus background.
  • For each category, there are in total k² channels and each channel is responsible for encoding a specific part of the object.
  • The final score of a category is determined by voting the k² responses.
  • Average pooling is used for voting. Then, a (C + 1)-d vector is obtained which indicates the probability that the object belongs to each class.
An intuitive description of the use of Local FCN

As shown above, e.g.: for the truncated person, one can hardly get a strong response from the global description. The local FCN can effectively capture several specific parts, such as human nose, mouth, etc.

1.2. Global FCN

  • Except the conventional RoI pooling (Yellow region at global FCN), one more RoI pooling layer (Green region at global FCN) is inserted as the context region to extract the global structure description of the object.
  • Specifically, this context region is by 2 times larger than the size of original proposal.
  • Then the features RoI pooled from the original region and context region are concatenated together.
  • Two convolutional layers with kernel size k×k and 1×1 respectively (k is set to the default value 7) are used to further abstract the global representation of RoI.
  • Finally, the output of 1×1 convolution is fed into the classifier whose output is also a (C + 1)-d vector.
An intuitive description of the use of Global FCN

For those having simple spatial structure and encompassing considerable background in the bounding box, e.g. dining table, global FCN helps to capture the global context to boost the detection performance.

1.3. Coupling structure

Effects of different normalization operation and coupling methods.
  • To merge the local FCN and global FCN, there are many approaches.
  • It is found that using 1×1 convolution to rescale the response, is much better than L2 normalization.
  • It is because L2 normalization reduces the output gap between different categories, which results in a smaller score gap, reduces the accuracy.
  • Element-wise sum always achieves the best performance even though in different normalization methods.
  • For element-wise product, it is even unstable during training.
  • For element-wise maximum, it equals to an ensemble model within the network to some extent. So, authors also compare with ensemble.
CoupleNet vs. model ensemble
  • The promotion brought by model ensemble (Rows 4 & 5) is less than 1 point. It is far less than CoupleNet (81.7%).
  • On the other hand, CoupleNet enjoys end-to-end training and there is no need to train multiple models, thus greatly reducing the training time.

1.4. Complexity

  • The whole network is fully convolutional.
  • The global branch can be regarded as a lightweight Faster R-CNN. The depth of RoI-wise subnetwork is only 2.
  • The computational complexity is far less than the subnetwork in ResNet-based Faster R-CNN system whose depth is 10.
  • Thus, CoupleNet can perform the inference efficiently, which runs slightly slower than R-FCN but much more faster than Faster R-CNN.

2. Experimental Results

2.1. PASCAL VOC 2007

Comparisons with Faster R-CNN and R-FCN using ResNet-101.
  • The models are trained on the union set of VOC 2007 trainval and VOC 2012 trainval (“07+12”), and evaluated on VOC 2007 test set.
  • CoupleNet without context region and without multi-scale training already outperforms Faster R-CNN, and R-FCN with or without multi-scale training.
  • With both context region and multi-scale training, CoupleNet obtains even higher mAP of 82.7%.
Results on PASCAL VOC 2007 test set
  • CoupleNet outperforms other SOTA approaches such as ION and SSD as well.

2.2. PASCAL VOC 2012

Results on PASCAL VOC 2012 test set
  • CoupleNet obtains a top mAP of 80.4%, which is 2.8 points higher than R-FCN.

Without using the extra tricks in the testing phase, CoupleNet is the first one with a mAP higher than 80%.

  • Below shows some detection examples of CoupleNet:
Detection examples of CoupleNet on PASCAL VOC 2012 test set

2.3. MS COCO

Results on COCO 2015 test-dev
  • CoupleNet are trained on the union set of 80k training set and 40k validation set, and evaluated on 20k test-dev set.
  • Multiscale training with the scales are randomly sampled from {480, 576, 672, 768, 864} while testing in a single scale.
  • Single-scale CoupleNet has already achieved a result of 33.1% which outperforms R-FCN by 3.9 points.
  • In addition, the multi-scale training further improves the performance up to 34.4%.
  • It is observed that the more challenging the dataset, the more the promotion (e.g., 2.2% for VOC07, 2.8% for VOC12 and 4.5% for COCO).

--

--

Nerd For Tech
Nerd For Tech

Published in Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/.

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet