Review: R-CNN (Object Detection)

Published in

Coinmonks

5 min readAug 31, 2018

--

Region-CNN (R-CNN) [1] is one of the state-of-the-art CNN-based deep learning object detection approaches. Based on this, there are fast R-CNN and faster R-CNN for faster speed object detection as well as mask R-CNN for object instance segmentation. On the other hand, there are also other object detection approaches, such as YOLO and SSD.

To know deep learning object detection approach well, R-CNN is a must read item. And it is a 2014 CVPR paper with about 6000 citations at the moment I was writing this story. (Sik-Ho Tsang @ Medium)

To have object detection, we need to know the class of object and also the bounding box size and location.

Conventionally, for each image, there is a sliding window to search every position within the image as below. It is a simple solution. However, different objects or even same kind of objects can have different aspect ratios and sizes depending on the object size and distance from the camera. And different image sizes also affect the effective window size. This process will be extremely slow if we use deep learning CNN for image classification at each location.

**Illustration of Sliding Window (Left) with Different Aspect Ratios and Sizes (Right)**

First, R-CNN uses selective search by [2] to generate about 2K region proposals, i.e. bounding boxes for image classification.
Then, for each bounding box, image classification is done through CNN.
Finally, each bounding box can be refined using regression.

What will be covered:

Selective Search
CNN-based Classification and Scoring
Results

1. Selective Search

Selective search is proposed by [2].

First, color similarities, texture similarities, region size, and region filling are used as non-object-based segmentation. Therefore we obtain many small segmented areas as shown at the bottom left of the image above.
Then, bottom-up approach is used that small segmented areas are merged together to form larger segmented areas.
Thus, about 2K region proposals (bounding box candidates) are generated as shown in the image.

2. CNN-based Classification and Scoring

AlexNet [3] is used to extract the CNN features.

For each proposal, a 4096-dimensional feature vector is computed by forward propagating a mean-subtracted 227×227 RGB image through five convolutional layers and two fully connected layers.

The input has the fixed size of 227×227 while bounding boxes have various shapes and sizes. So, all pixels in a tight bounding box are warped to 227×227 size.

The feature vector is scored by SVM trained for each class.

For each class, High IoU (Intersection over Union) overlapping bounding boxes are rejected since they are bounding the same object.

The predicted bounding box can be further fine-tuned by another bounding box regressor.