Region — Based Convolutional Neural Network (RCNN)

Ramji Balasubramanian
Analytics Vidhya
Published in
3 min readJan 27, 2021

R-CNN is a region based Object Detection Algorithm developed by Girshick et al., from UC Berkeley in 2014. Before jumping into the algorithm lets try to understand what object detection actually means and how it differs from image classification.

What is Object Detection?

Take the example for the above given image, and think about image classification. We will be able to classify whether the given image is dog/cat from first example, where as in second example we cant unless we detect dog and cat individually. In real world, we always have multiple objects in single image. This is were the importance of object detection comes.

Image Classification: Predict the type or class of an object in an image.

  • Input: An image with a single object, such as a photograph.
  • Output: A class label (e.g. one or more integers that are mapped to class labels).

Object Detection: Locate the presence of objects with a bounding box and types or classes of the located objects in an image.

  • Input: An image with one or more objects, such as a photograph.
  • Output: One or more bounding boxes (e.g. defined by a point, width, and height), and a class label for each bounding box.

R-CNN Algorithm

To understand R-CNN we need to have some prior knowledge on how Convolutional Neural Network works and what is Mean Average Precision (mAP) metric to measure the performance.

Link:

CNN — https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/

mAP — https://datanics.blogspot.com/2020/11/understanding-mean-average-precision.html?m=1

R-CNN(paper) one of the first large and successful application of convolutional neural networks to the problem of object localization, detection, and segmentation. The approach was demonstrated on benchmark datasets, achieving then state-of-the-art results on the VOC-2012 dataset and the 200-class ILSVRC-2013 object detection dataset.

Architecture:

Region based CNN consists of three modules — Region Proposal, Feature Extractor and Classifier.

Region Proposal : When an input image is given region proposal tries to detect different regions (~2000) in different sizes and aspect ratios. In other words, it draws multiple bounding boxes in input image as shown below.

Region Proposal Result

Feature Extractor: Each proposed region will be trained by a CNN network and the last layer (4096 features) will be extracted as features so the final output from Feature extractor will be Number of proposed regions x 4096

Classifier: Once the features are extracted we need to classify the objects inside each regions. To do this a linear SVM model is trained for classification, Specifically one SVM model for each class.

Cons of R-CNN:

  • It takes a huge amount of time to train the network as you would have to classify 2000 region proposals per image.
  • It cannot be implemented real time as it takes around 47 seconds for each test image.
  • The selective search algorithm is a fixed algorithm. Therefore, no learning is happening at that stage. This could lead to the generation of bad candidate region proposals.

Thanks for reading!!!

--

--