Region-CNN (R-CNN)  is one of the state-of-the-art CNN-based deep learning object detection approaches. Based on this, there are fast R-CNN and faster R-CNN for faster speed object detection as well as mask R-CNN for object instance segmentation. On the other hand, there are also other object detection approaches, such as YOLO and SSD.
To know deep learning object detection approach well, R-CNN is a must read item. And it is a 2014 CVPR paper with about 6000 citations at the moment I was writing this story. (Sik-Ho Tsang @ Medium)
To have object detection, we need to know the class of object and also the bounding box size and location.
Conventionally, for each image, there is a sliding window to search every position within the image as below. It is a simple solution. However, different objects or even same kind of objects can have different aspect ratios and sizes depending on the object size and distance from the camera. And different image sizes also affect the effective window size. This process will be extremely slow if we use deep learning CNN for image classification at each location.
- First, R-CNN uses selective search by  to generate about 2K region proposals, i.e. bounding boxes for image classification.
- Then, for each bounding box, image classification is done through CNN.
- Finally, each bounding box can be refined using regression.
What will be covered:
- Selective Search
- CNN-based Classification and Scoring
1. Selective Search
Selective search is proposed by .
- First, color similarities, texture similarities, region size, and region filling are used as non-object-based segmentation. Therefore we obtain many small segmented areas as shown at the bottom left of the image above.
- Then, bottom-up approach is used that small segmented areas are merged together to form larger segmented areas.
- Thus, about 2K region proposals (bounding box candidates) are generated as shown in the image.
2. CNN-based Classification and Scoring
AlexNet  is used to extract the CNN features.
For each proposal, a 4096-dimensional feature vector is computed by forward propagating a mean-subtracted 227×227 RGB image through five convolutional layers and two fully connected layers.
The input has the fixed size of 227×227 while bounding boxes have various shapes and sizes. So, all pixels in a tight bounding box are warped to 227×227 size.
The feature vector is scored by SVM trained for each class.
For each class, High IoU (Intersection over Union) overlapping bounding boxes are rejected since they are bounding the same object.
The predicted bounding box can be further fine-tuned by another bounding box regressor.
3.1 VOC 2010
R-CNN and R-CNN BB obtain the highest mAP (mean average prediction).
3.2 ILSVRC 2013
R-CNN BB even outperforms OverFeat , which is the winner of ILSVRC 2013 localization task!
3.3 VOC 2007
As you may already know, the CNN used in R-CNN can be changed to any CNNs used in image classification.
When R-CNN BB uses VGG-16  which is a 16-layer VGGNet, mAP is even increased to 66.0%.
If interested, please read also my reviews about AlexNet, VGGNet, and OverFeat. (Links at the bottom)
And I will write more reviews for other state-of-the-art deep learning approaches.
- [2014 CVPR] [R-CNN]
Rich feature hierarchies for accurate object detection and semantic segmentation
- [2013 IJCV] [Selective Search]
Selective Search for Object Recognition
- [2012 NIPS] [AlexNet]
ImageNet Classification with Deep Convolutional Neural Networks
- [2014 ICLR] [OverFeat]
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks
- [2015 ICLR] [VGGNet]
Very Deep Convolutional Networks for Large-Scale Image Recognition
- Review: AlexNet, CaffeNet — Winner of ILSVRC 2012 (Image Classification)
- Review: OverFeat — Winner of ILSVRC 2013 Localization Task (Object Detection)
- Review: VGGNet — 1st Runner-Up of ILSVRC 2014 (Image Classification)