Learning Day 64: Object detection 3 — Fast R-CNN and Faster R-CNN

De Jun Huang
dejunhuang
Published in
3 min readJun 18, 2021

Fast R-CNN

Advantages

  • Better performance
  • Faster than R-CNN and SPP-Net (training: 8.8x faster than R-CNN, single image test: 146x faster at 0.32s)
  • End to end object detection
  • All layers can be fine tuned

New techniques

1. RoI pooling for Selective Search

  • It is a special case of SPP pooling
  • At SPP pooling, there are various grid sizes for the same region
  • At ROI pooling, it only uses the finest grid sizes. (eg. 7x7 for VGG)
  • In each grid, perform max pooling
  • Finding a function f to establish bounding box as close to groundtruth box as possible. It does translation first, scaling second

2. Multi-task loss

  • Combine classification and regression losses to one
Multi-task loss (ref)

Training/Fine tuning procedure

Mini batch sampling

  • Batch size (128) = images in batch (2) x RoI in each image (64)
  • RoI grouping based on overlapping with groundtruth with following rules
  • (1) 25% objects with IoU ≥0.5
  • (2) 75% background with IoU=[0.1, 0.5)

Other details

  • Due to the large amount of RoIs, almost of of the time is used for FC layers calculations. Can be accelerated by using SVD

Faster R-CNN

  • Faster R-CNN = Fast R-CNN + RPN (Region Proposal Network)
  • Even faster (single image test: 0.198s)
  • Replace the last bit of non NN component, Selective Search, with NN structure RPN
Faster R-CNN illustration (ref)

Region Proposal Network (RPN)

Advantages

  • Enable weight sharing for conv layers
  • No more offline Selective Search
  • Less region proposals but higher quality

How it works

  • Taking conv 5 feature map from the earlier conv layers
  • In the sliding window, take k anchor boxes of various sizes
  • Use 3x3 conv layer to get 256-d layer
  • Use 1x1 conv layer to get 4k-d layer for regression
  • Use another 1x1 conv layer to get 2k-d layer for classification
How RPN works (ref)
  • For anchor box, eg. k=9 →3 scales (128, 256, 512) with 3 ratios (1:1, 1:2, 2:1)
Example of anchor boxes at different scales and ratios (ref)

RPN loss

  • Lcls for object or non object
  • Lreg uses smooth L1
  • mini-batch sampling:
  • — single image
  • — 128 positive samples. IoU > 0.7 anchor boxes or the largest
  • — 128 negative samples. IoU< 0.3 anchor boxes

Training procedure for Faster R-CNN

Step 1. Train RPN

  • Initialise conv layers with pretrained weights from ImageNet
  • Generate region proposals and pass to R-CNN

Step 2. Train fast R-CNN

  • Initialise conv layers with pretrained weights from ImageNet

(notice the above two blocks of conv layers are different and not shared)

Step 3. Fine tune RPN

  • Initialise conv layers with Fast R-CNN weights
  • Fix conv layer, fine tune the rest
  • Generate better region proposals and pass to R-CNN

Step 4. Find tune Fast R-CNN

  • Fix conv layers, fine tune the rest

(Notice that the conv layers in step 3 and 4 are shared, those were the resultant layers trained in step 2)

Reference

link1

--

--