Region with Convolutional Neural Network (
R-CNN) is proposed by Girshick et al. in 2013. It changed the object detection field fundamentally. By leveraging
CNN and SVM, Girshick et al. achieved a very good result in VOC 2012.
This story will introduce
R-CNN while later series will cover
Faster R-CNN and
Mask R-CNN that is introduced by Girshick and other team members as well. Besides, there are other objection detection approaches such as
Single-Shot Object Detector (SSD) and
You Only Look Once (YOLO).
The objective of image classification is classifying the category of the whole image. On the other hand, object detection includes not only classification but also localization which was treated as a regression problem by Girshick et al..
This story will discuss R-CNN (Girshick et al., 2013) and the following will be covered:
- Architecture of R-CNN
- Selective Search
- Feature Extraction
Architecture of R-CNN
Given an image, R-CNN use
selective search to generate around 2000 region proposals to compute features by using a convolutional neural network (CNN). Region proposals are regions that include the potential object. It will be wrapped as 227 x 227 RGB to fit into CNN. Feature extraction will be done in CNN layers and passing to multiple binary classifiers to figure out the class of particular regions.
selective Search approach is applied to find region proposals for classification.
selective Search can capture any possible scales and less computational complexity. 2000 regions are selected from
The design considerations of
selective search are:
- Capture all scales
- Fast to compute
To achieve that, the Hierarchical Grouping Algorithm is chosen to group those similar regions in a bottom-up approach. Feature for calculating similarity includes color, texture, region size, region filling,
Hierarchical Grouping Algorithm
It is greedy-search to find the region proposals. The procedures are:
- Obtain initialized regions by segment
- Calculating the similarity of neighboring regions
- Grouping similar region to the same bucket
- Coming to the most similar region within the same bucket. (Repeat this step)
Calculating different attributes to find the similarity. Possible attributes can be light intensity, shading, size, etc.
CNN is chosen to perform the feature extraction. As mentioned before, selective search targets to capture all regions which imply that there are a different scale and ratio images. In order to fit into CNN, all regions are wrapped as a 227 x 227 RGB image. After that 4k dimensional feature via 5 convolutional layers and 2 fully connected layers.
Every region is classified by multiple SVM binary classifiers. By applying greedy non-maximum suppression, a high intersection-over-union (IoU) overlap with a higher score region will be rejected for each class.
- There are some drawbacks on using
selective searchto identify region proposals. Time-consuming is one of the main issues. Since it takes a very large amount of region proposals (~2k) and all of them have to throughout the CNN model. It takes around 53 seconds for 1 picture.
- 3 models in total which are CNN model, SVM classification model, and bounding box regression model. It is a challenge to train 3 models separately.
I am a Data Scientist in the Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence, especially in NLP and platform related. Feel free to connect with me on LinkedIn or follow me on Medium or Github.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. 2013
- J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers and A.W.M. Selective Search for Object Recognition. 2012