Complete Guide on Region-Based Convolution Neural Networks [RCNN]

VENKATESH MUNGI
5 min readOct 13, 2022

--

Although, the information has taken from different sources, only my intention is to provide a complete knowledge on Region-Based Convolution Neural Networks.

What is Region-Based Convolution Neural Network?

Since Convolution Neural Network (CNN) with a fully connected layer is not able to deal with the frequency of occurrence and multi objects. So, one way could be that we use a sliding window brute force search to select a region and apply the CNN model on that, but the problem of this approach is that the same object can be represented in an image with different sizes and different aspect ratio. While considering these factors we have a lot of region proposals and if we apply deep learning (CNN) on all those regions that would computationally very expensive.

Ross Girshick et al.in 2013 proposed an architecture called R-CNN (Region-based CNN) to deal with this challenge of object detection. This R-CNN architecture uses the selective search algorithm that generates approximately 2000 region proposals. These 2000 region proposals are then provided to CNN architecture that computes CNN features. These features are then passed in an SVM model to classify the object present in the region proposal. An extra step is to perform a bounding box regressor to localize the objects present in the image more precisely.

“The original goal of R-CNN was to take an input image and produce a set of bounding boxes as output, where each bounding box contains an object and also the category (e.g., car or pedestrian) of the object.”

More recently, R-CNN has been extended to perform other computer vision tasks.

History and Development of RCNN

R-CNN is a region-based Object Detection Algorithm developed by Girshick et al., from UC Berkeley in 2013. The following covers some of the versions of R-CNN that have been developed.

· November 2013: R-CNN. Given an input image, R-CNN begins by applying a mechanism called Selective Search to extract regions of interest (ROI), where each ROI is a rectangle that may represent the boundary of an object in image. Depending on the scenario, there may be as many as two thousand ROIs. After that, each ROI is fed through a neural network to produce output features. For each ROI’s output features, a collection of support-vector machine classifiers is used to determine what type of object (if any) is contained within the ROI.

· April 2015: Fast R-CNN. While the original R-CNN independently computed the neural network features on each of as many as two thousand regions of interest, Fast R-CNN runs the neural network once on the whole image. At the end of the network is a novel method called ROI Pooling, which slices out each ROI from the network’s output tensor, reshapes it, and classifies it. As in the original R-CNN, the Fast R-CNN uses Selective Search to generate its region proposals.

· June 2015: Faster R-CNN. While Fast R-CNN used Selective Search to generate ROIs, Faster R-CNN integrates the ROI generation into the neural network itself.

· March 2017: Mask R-CNN. While previous versions of R-CNN focused on object detection, Mask R-CNN adds instance segmentation. Mask R-CNN also replaced ROI Pooling with a new method called ROI Align, which can represent fractions of a pixel.

· June 2019: Mesh R-CNN adds the ability to generate a 3D mesh from a 2D image.

NOTE: To understand R-CNN we need to have some prior knowledge on how Convolutional Neural Network works and what is Mean Average Precision (mAP) metric to measure the performance.

CNN: https://www.edureka.co/blog/convolutional-neural-network/

MAP: https://datanics.blogspot.com/2020/11/understanding-mean-average-precision.html?m=1

Region based CNN consists of three modules — Region Proposal, Feature Extractor and Classifier.

1.Region Proposals:

Region proposals are simply the smaller regions of the image that possibly contains the objects we are searching for in the input image. When an input image is given region proposal tries to detect different regions (~2000) in different sizes and aspect ratios. In other words, it draws multiple bounding boxes in input image as shown below.

Region Proposal Result

To reduce the region proposals in the R-CNN uses a greedy algorithm called selective search.

2.Feature Extractor: Each proposed region will be trained by a CNN network and the last layer (4096 features) will be extracted as features so the final output from Feature extractor will be Number of proposed regions x 4096

3.Classifier: Once the features are extracted, we need to classify the objects inside each region. To do this a linear SVM model is trained for classification, Specifically one SVM model for each class.

Selective Search: Selective search is a greedy algorithm that combines smaller segmented regions to generate region proposal. This algorithm takes an image as input and output generate region proposals on it. This algorithm has the advantage over random proposal generation is that it limits the number of proposals to approximately 2000 and these region proposals have a high recall.

Algorithm Of Selective Search:

1. Generate initial sub-segmentation of input image using the method describe by Felzenszwalb et al in his paper “Efficient Graph-Based Image Segmentation “.

2. Recursively combine the smaller similar regions into larger ones. We use Greedy algorithm to combine similar regions to make larger regions. The algorithm is written below.

Greedy Algorithm:

1. From set of regions, choose two that are most similar.

2. Combine them into a single, larger region.

3. Repeat the above steps for multiple iterations.

Read more about Selective Search:

http://www.huppelen.nl/publications/selectiveSearchDraft.pdf

For Implementation of Selective search please follow through:

https://learnopencv.com/selective-search-for-object-detection-cpp-python/

https://towardsdatascience.com/step-by-step-r-cnn-implementation-from-scratch-in-python-e97101ccde55

https://towardsdatascience.com/object-localization-using-pre-trained-cnn-models-such-as-mobilenet-resnet-xception-f8a5f6a0228d

https://medium.com/analytics-vidhya/object-localization-using-keras-d78d6810d0be

https://blog.paperspace.com/object-localization-using-pytorch-1/

https://blog.paperspace.com/object-localization-pytorch-2/

https://pyimagesearch.com/2020/07/06/region-proposal-object-detection-with-opencv-keras-and-tensorflow/

https://pyimagesearch.com/2020/06/22/turning-any-cnn-image-classifier-into-an-object-detector-with-keras-tensorflow-and-opencv/

Challenges of R-CNN:

3. Selective Search algorithm is very rigid and there is no learning happens in that. This sometimes leads to bad region proposals generation for object detection.

4. Since there are approximately 2000 candidate proposals. It takes a lot of time to train the network. Also, we need to train multiple steps separately (CNN architecture, SVM model, bounding box regressor). So, this makes it very slow to implement.

5. R-CNN cannot be used in real time because it takes approximately 50 sec to test an image with bounding box regressor.

6. Since we need to save feature maps of all the region proposals. It also increases the amount of disk memory required during training.

References:

https://www.geeksforgeeks.org/r-cnn-region-based-cnns/

https://en.wikipedia.org/wiki/Region_Based_Convolutional_Neural_Networks

https://medium.com/analytics-vidhya/region-based-convolutional-neural-network-rcnn-b68ada0db871

https://www.geeksforgeeks.org/selective-search-for-object-detection-r-cnn/

Image Resources:

https://miro.medium.com/max/1280/1*03Is1NmjgaZkXwxCmw9I_g.jpeg

  • *********** At the end I express heartfelt thanks to all authors **********
  • Thank you all!

--

--