Learning Day 65: Object detection 4 — R-FCN
Published in
3 min readJun 19, 2021
Past models
R-CNN
- SS to get region proposals
- Every region proposal will go through conv+fc layers
- Separate classifier (object classification) and regressor (bounding box location)
SPP-Net
- Feed entire image to conv layers followed by SPP pooing for feature extraction
- No fine tuning
- Separate classifier and regressor
Fast R-CNN
- ROI pooling instead of SPP
- Has fine tuning
- End to end network with multi-task loss
- Almost a fully network-based structure except for selective search step
Faster R-CNN
- RPN for selective search. Faster R-CNN = RPN + Fast R-CNN
- ROI pooling
- Has fine tuning
- End to end network with multi-task loss.
- Fully network-based structure
- Weight sharing for conv layers between RPN and Fast R-CNN
Drawbacks of past models
- Based on traditional CNN structure: Conv + FC layers. Weight sharing exists in conv layers only.
- RoI-wise sub networks: Each RoI will go through separate networks. No weight sharing between RoIs
R-FCN
- Similar to the trend in CNN, trying to make R-CNN fully conv layer-based
Dilemma in detection and classification tasks
- Detection needs to be sensitive to object translation (translation variance)
- Classification needs to be insensitive to translation (translation invariance)
- With deeper conv layers, translation invariance becomes more dominant
New techniques to boost translation variance
- Position-sensitive score maps
- Position-sensitive RoI pooling
Position-sensitive score maps
- Each layer of score map (identified by different colours) is actually consists of C+1 layers with C being the no. of classes, +1 is the background class
- Each layer of score map contain information of each class at a particular location. Eg. the first layer on the right (light blue), it represents the bottom-right grid of the pooling layer
Position-sensitive RoI pooling
- It serves as a voting summary
- It takes the average of each layer of score maps in one colour for every class and feed that score into the corresponding position
- So 9 grids information is squeezed to 1 grid, but the thickness stays the same at C+1
- In the above figure, the pooling layer is 3x3. Eg. If the bottom-right grid (light blue) has the highest score for a certain class Ci, that means there is a high probability that that region contains a class Ci object.
OHEM (Online Hard Example Mining)
- Use RPN to get RoI and sort for both positive and negative samples
- Pick the front N RoIs (harder examples) and negative examples in a way that positive:negative = 1:3
- Hard examples to make it more robust
Training procedure
- It is similar to faster R-CNN: train PRN and R-FCN alternatively
Performance improvement
- Improve position sensitivity compared to Faster R-CNN
- Improve training and test time by about 2.5x based on ResNet-101
- OHEM does not affect time spent