Learning Day 64: Object detection 3 — Fast R-CNN and Faster R-CNN

Published in

dejunhuang

3 min readJun 18, 2021

--

Fast R-CNN

Advantages

Better performance
Faster than R-CNN and SPP-Net (training: 8.8x faster than R-CNN, single image test: 146x faster at 0.32s)
End to end object detection
All layers can be fine tuned

New techniques

1. RoI pooling for Selective Search

It is a special case of SPP pooling
At SPP pooling, there are various grid sizes for the same region
At ROI pooling, it only uses the finest grid sizes. (eg. 7x7 for VGG)
In each grid, perform max pooling
Finding a function f to establish bounding box as close to groundtruth box as possible. It does translation first, scaling second

2. Multi-task loss

Combine classification and regression losses to one

Multi-task loss (ref)

Training/Fine tuning procedure

Mini batch sampling

Batch size (128) = images in batch (2) x RoI in each image (64)
RoI grouping based on overlapping with groundtruth with following rules
(1) 25% objects with IoU ≥0.5
(2) 75% background with IoU=[0.1, 0.5)

Other details

Due to the large amount of RoIs, almost of of the time is used for FC layers calculations. Can be accelerated by using SVD

Faster R-CNN

Faster R-CNN = Fast R-CNN + RPN (Region Proposal Network)
Even faster (single image test: 0.198s)
Replace the last bit of non NN component, Selective Search, with NN structure RPN

Faster R-CNN illustration (ref)

Region Proposal Network (RPN)

Advantages

Enable weight sharing for conv layers
No more offline Selective Search
Less region proposals but higher quality

How it works

Taking conv 5 feature map from the earlier conv layers
In the sliding window, take k anchor boxes of various sizes
Use 3x3 conv layer to get 256-d layer
Use 1x1 conv layer to get 4k-d layer for regression
Use another 1x1 conv layer to get 2k-d layer for classification

How RPN works (ref)

For anchor box, eg. k=9 →3 scales (128, 256, 512) with 3 ratios (1:1, 1:2, 2:1)

Example of anchor boxes at different scales and ratios (ref)

RPN loss

Lcls for object or non object
Lreg uses smooth L1
mini-batch sampling:
— single image
— 128 positive samples. IoU > 0.7 anchor boxes or the largest
— 128 negative samples. IoU< 0.3 anchor boxes

Training procedure for Faster R-CNN

Step 1. Train RPN

Initialise conv layers with pretrained weights from ImageNet
Generate region proposals and pass to R-CNN

Step 2. Train fast R-CNN

Initialise conv layers with pretrained weights from ImageNet

(notice the above two blocks of conv layers are different and not shared)

Step 3. Fine tune RPN

Initialise conv layers with Fast R-CNN weights
Fix conv layer, fine tune the rest
Generate better region proposals and pass to R-CNN

Step 4. Find tune Fast R-CNN

Fix conv layers, fine tune the rest

(Notice that the conv layers in step 3 and 4 are shared, those were the resultant layers trained in step 2)

Reference

Machine Learning

De Jun Huang

Written by De Jun Huang

Editor for

dejunhuang

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams