Learning Day 65: Object detection 4 — R-FCN

De Jun Huang
Published in
3 min readJun 19, 2021

Past models


  • SS to get region proposals
  • Every region proposal will go through conv+fc layers
  • Separate classifier (object classification) and regressor (bounding box location)


  • Feed entire image to conv layers followed by SPP pooing for feature extraction
  • No fine tuning
  • Separate classifier and regressor

Fast R-CNN

  • ROI pooling instead of SPP
  • Has fine tuning
  • End to end network with multi-task loss
  • Almost a fully network-based structure except for selective search step

Faster R-CNN

  • RPN for selective search. Faster R-CNN = RPN + Fast R-CNN
  • ROI pooling
  • Has fine tuning
  • End to end network with multi-task loss.
  • Fully network-based structure
  • Weight sharing for conv layers between RPN and Fast R-CNN

Drawbacks of past models

  • Based on traditional CNN structure: Conv + FC layers. Weight sharing exists in conv layers only.
  • RoI-wise sub networks: Each RoI will go through separate networks. No weight sharing between RoIs


  • Similar to the trend in CNN, trying to make R-CNN fully conv layer-based

Dilemma in detection and classification tasks

  • Detection needs to be sensitive to object translation (translation variance)
  • Classification needs to be insensitive to translation (translation invariance)
  • With deeper conv layers, translation invariance becomes more dominant

New techniques to boost translation variance

  • Position-sensitive score maps
  • Position-sensitive RoI pooling
The centre coloured block the the score maps. The right most coloured block is the result of RoI pooling (ref)

Position-sensitive score maps

  • Each layer of score map (identified by different colours) is actually consists of C+1 layers with C being the no. of classes, +1 is the background class
  • Each layer of score map contain information of each class at a particular location. Eg. the first layer on the right (light blue), it represents the bottom-right grid of the pooling layer

Position-sensitive RoI pooling

  • It serves as a voting summary
  • It takes the average of each layer of score maps in one colour for every class and feed that score into the corresponding position
  • So 9 grids information is squeezed to 1 grid, but the thickness stays the same at C+1
  • In the above figure, the pooling layer is 3x3. Eg. If the bottom-right grid (light blue) has the highest score for a certain class Ci, that means there is a high probability that that region contains a class Ci object.
In example 1 here, the RoI is bounding a person, and the voting map has generally high score (represented by light colours) (ref)
In example 2 here, the RoI is shifted, so the voting map only has a few high score and many low score areas (low score areas is represented by dark colours) (ref)

OHEM (Online Hard Example Mining)

  • Use RPN to get RoI and sort for both positive and negative samples
  • Pick the front N RoIs (harder examples) and negative examples in a way that positive:negative = 1:3
  • Hard examples to make it more robust

Training procedure

  • It is similar to faster R-CNN: train PRN and R-FCN alternatively

Performance improvement

  • Improve position sensitivity compared to Faster R-CNN
  • Improve training and test time by about 2.5x based on ResNet-101
  • OHEM does not affect time spent



