Understanding the roots of Object Detection

RCNN
Region based convolution neural network is used for object detection. RCNN identifies the object in the image and outputs bounding box coordinates to localize the object.
According to the original paper (https://arxiv.org/abs/1311.2524)
Our object detection system consists of three modules. The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector. The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. The third module is a set of class specific linear SVMs.
Hence in case of RCNN many regions are first proposed using selective search (approximately 2000 regions) and then a CNN is run over all of them to find out whether the regions contain an object or not by converting the input image into a fixed length embedding /encoding.
There is a SVM in the end to classify the object in the region i.e. whether the image consists of object or simply background and if it is an object then it’s class too.
Now after it is confirmed that the region consists of an object, it is passed into the regressor which simply predicts a new set of coordinates to fit the bounding box around the object better.
To be noted, RCNN is a 3 module pipeline hence it is quite lengthy for training and testing.
Q — What happens when multiple classes are predicted for some objects at test time (with varying amount of probability) ?
A — Non Max Suppression.
Fast RCNN
R CNN is quite an intuitive and nice pipeline…
- Predict ~2000 regions for every image
- Every region in every image is passed through a CNN
- After feature generation / embedding generation, it is run through SVMs for different classes to predict the class (on the basis of class scores, IOU and Non Max Suppression).
- After the class is predicted, it is passed through a regressor to correct the bounding box fit.
In short, the pipeline is as follows…
Image -> 2k proposals (using selective search algorithm) -> Run a CNN on every proposal -> SVMs to score the object class prediction -> regressor to fit the bounding box
The process is intuitive but quite slow since it is run on 3 different modules separately. Running a CNN over so many different examples ( ~2k) proves to be the bottleneck for the whole pipeline.
Fast RCNN uses the fact that computation for many of the 2000 regions can be shared, hence it swaps the order of operation in RCNN.
The input to the network is the image along with some region of interests and output is the predicted bounding boxes (using selective search, same as RCNN). Citing from the original paper.
A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map.
First a feature map is generated using Convolution layers and then ROI max pooling layer is used. It is simply used because the regions are of different sizes and therefore need to be resized into same dimensions to make it fit for the next set of FC layers.
In the end the FC layer branches out to a softmax layer for class prediction and a regressor for bounding box coordinates prediction.
Hence comparing to R CNN , Fast R CNN has quite a few differences…
- Input is passed through a CNN first as compared to R CNN where CNN is used for every region.
- ROI Pooling layer is used so that the network becomes faster and supports end to end training (as compared to modular training in RCNN).
- Softmax Classifier is used as compared to SVM in case of R CNN.
Faster RCNN
Faster RCNN is Fast RCNN only but instead it uses a CNN to predict the possible regions in the image i.e. it no longer depends on the selective search algorithm for region proposals but uses a Region Proposal Network.
The rest of the steps are same as Fast RCNN, only difference is use of RPN for predicting the regions in the image.
The RPN consists of a CNN which validates all the anchor boxes (generated over the feature map in a sliding window fashion) and outputs the bounding box coordinates and the object score of the bounding boxes. It validates over k such bounding boxes (of different aspect ratios / sizes) corresponding to every pixel of the feature map by comparing it with the ground truth label.
This is the end of this post , next time I will try to explain YOLO model for object detection.