Learning Day 65: Object detection 4 — R-FCN

Published in

dejunhuang

3 min readJun 19, 2021

--

Past models

R-CNN

SS to get region proposals
Every region proposal will go through conv+fc layers
Separate classifier (object classification) and regressor (bounding box location)

SPP-Net

Feed entire image to conv layers followed by SPP pooing for feature extraction
No fine tuning
Separate classifier and regressor

Fast R-CNN

ROI pooling instead of SPP
Has fine tuning
End to end network with multi-task loss
Almost a fully network-based structure except for selective search step

Faster R-CNN

RPN for selective search. Faster R-CNN = RPN + Fast R-CNN
ROI pooling
Has fine tuning
End to end network with multi-task loss.
Fully network-based structure
Weight sharing for conv layers between RPN and Fast R-CNN

Drawbacks of past models

Based on traditional CNN structure: Conv + FC layers. Weight sharing exists in conv layers only.
RoI-wise sub networks: Each RoI will go through separate networks. No weight sharing between RoIs

R-FCN

Similar to the trend in CNN, trying to make R-CNN fully conv layer-based

Dilemma in detection and classification tasks

Detection needs to be sensitive to object translation (translation variance)
Classification needs to be insensitive to translation (translation invariance)
With deeper conv layers, translation invariance becomes more dominant

New techniques to boost translation variance

Position-sensitive score maps
Position-sensitive RoI pooling

The centre coloured block the the score maps. The right most coloured block is the result of RoI pooling (ref)

Position-sensitive score maps

Each layer of score map (identified by different colours) is actually consists of C+1 layers with C being the no. of classes, +1 is the background class
Each layer of score map contain information of each class at a particular location. Eg. the first layer on the right (light blue), it represents the bottom-right grid of the pooling layer

Position-sensitive RoI pooling

It serves as a voting summary
It takes the average of each layer of score maps in one colour for every class and feed that score into the corresponding position
So 9 grids information is squeezed to 1 grid, but the thickness stays the same at C+1
In the above figure, the pooling layer is 3x3. Eg. If the bottom-right grid (light blue) has the highest score for a certain class Ci, that means there is a high probability that that region contains a class Ci object.

In example 1 here, the RoI is bounding a person, and the voting map has generally high score (represented by light colours) (ref)

In example 2 here, the RoI is shifted, so the voting map only has a few high score and many low score areas (low score areas is represented by dark colours) (ref)

OHEM (Online Hard Example Mining)

Use RPN to get RoI and sort for both positive and negative samples
Pick the front N RoIs (harder examples) and negative examples in a way that positive:negative = 1:3
Hard examples to make it more robust

Training procedure

It is similar to faster R-CNN: train PRN and R-FCN alternatively

Performance improvement

Improve position sensitivity compared to Faster R-CNN
Improve training and test time by about 2.5x based on ResNet-101
OHEM does not affect time spent

Reference

Machine Learning

De Jun Huang

Written by De Jun Huang

Editor for

dejunhuang

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams