Object detection in Deep learning (Part2)

Published in

AI³ | Theory, Practice, Business

4 min readSep 22, 2019

R-CNN & Fast R-CNN

Following part1, an object-detection-algorithm has to draw up to several bounding boxes representing different objects of interest within the image and you would not know how many beforehand.

A direct approach (brut force) to solve this issue would be to take different regions of interest from the image and use a CNN to classify the presence of the object within that region. The problem here, the objects of interest might have different spatial locations within the image and different aspect ratios. Hence, you would have to select a huge number of regions and this could computationally hard (increasingly hard). Therefore, algorithms like R-CNN, YOLO, etc have been developed to find these occurrences and find them fast.

R-CNN

The proposed idea [1], instead of trying to classify a huge number of regions, you can just work with 2000 regions. These 2000 region proposals are generated using the selective search algorithm.

These 2000 candidate region proposals are fed into a convolutional neural network that produces a 4096-dimensional feature vector as output. The CNN acts as a feature extractor and the output dense layer consists of the features extracted from the image and the extracted features are fed into a classifier to detect the presence of the object within that candidate region proposal. Also, it predicts the four offset values to increase the precision of the bounding box.

The main problem with this technic is it still takes a huge amount of time to train the network as you would have to classify 2000 region proposals per image.

Fast R-CNN

A faster object detection algorithm called Fast R-CNN

The approach is similar to the R-CNN algorithm. But, instead of feeding the region proposals to the CNN, we feed the input image to the CNN to generate a convolutional feature map. From the convolutional feature map, we identify the region of proposals and warp them into squares and by using an RoI pooling layer we reshape them into a fixed size so that it can be fed into a fully connected layer.

The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000 region proposals to the convolutional neural network every time. Instead, the convolution operation is done only once per image and a feature map is generated from it.

The drawback of this method, when you look at the performance of Fast R-CNN during testing time, including region proposals, slows down the algorithm significantly when compared to not using region proposals.
Both of the above algorithms(R-CNN & Fast R-CNN) uses selective search to find out the region proposals. Selective search is a slow and time-consuming process affecting the performance of the network.

Similar to Fast R-CNN, the image is provided as an input to a convolutional network which provides a convolutional feature map. Instead of using selective search algorithm on the feature map to identify the region proposals, a separate network is used to predict the region proposals. The predicted region proposals are then reshaped using a RoI pooling layer which is then used to classify the image within the proposed region and predict the offset values for the bounding boxes.

YOLO: Real-Time Object Detection

(You Only Look Once)

The algorithm applies a neural network to an entire image. The method used to come up with these probabilities is logistic regression. The bounding boxes are weighted by the associated probabilities. For class prediction, independent logistic classifiers are used.

YOLO divides the input image into an S×S grid. Each grid cell predicts only one object.

For each grid cell,

it predicts B boundary boxes and each box has one box confidence score,
it detects one object only regardless of the number of boxes B,
it predicts C conditional class probabilities (one per class for the likeliness of the object class).

Object detection in Deep learning (Part2)

R-CNN

Fast R-CNN

YOLO: Real-Time Object Detection

Further reading and references:

Written by Amin Ag