Object Detection: A journey from R-CNN to Mask R-CNN and YOLO

Published in

Augmented AI

5 min readFeb 6, 2023

Before going into the details of R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN, you might be thinking why we cant use Convolutional Neural Networks (CNN) for the Object Detection. CNN gives very good results in case of classification like is there a cat or a dog in the image or in case of multi class classification it gives a very promising performance. But it case of Object Detection we need to figure out where is the object in the image. So, Object Detection is (Image Classification plus Localization). One way to use CNN for the object detection is to divide the image into small blocks but if we have objects of different sizes in image, we can divide the image into 1000 x 1000 grid but it will become computationally expensive.

Region based Convolutional Neural Networks (R-CNN):

In 2014, Ross Girshick et al. came with the idea of RCNN, the goal of the RCNN is to take an image and correctly identify where the object is via a bounding box.

What R-CNN does is it divide the image into number of boxes/regions instead of grids. RCNN uses selective search algorithm to extract 2000 regions from the images called region proposals.

In R-CNN, we identify different regions in the image and then pass them to the feature extractor

R-CNN is based on the following steps:

Generate a set of region proposals using selective search algorithm for the bounding boxes.
Run the images in the bounding boxes through the pretrained AlexNet.
Finally an SVM to see what object the image has in the bounding box.
Run the bounding box through a Linear Regression model to output detected coordinates for the box once the object has been classified.

Primary Challenge with R-CNN

Very slow as it has to classify 2000 regions per image.
It cannot be implemented real time as it takes around 47 seconds for each test image.

Fast R-CNN:

In 2015, the authors of R-CNN came with the idea of Fast R-CNN to address the drawbacks in the R-CNN.

In Fast R-CNN, the input image is passed to the Convolutional Neural Network (feature extractor) to generate convolutional feature map and from the convolutional feature map we identify the region of proposals.

The reason why Fast R-CNN is faster than the R-CNN is that we don’t need to feed 2000 region proposals to the Convolutional Neural Network each time. Instead the convolution operation is done once per image and the feature map is generated from it and then we use selective search algorithm to extract 2000 regions from the images called region proposals.

Primary Challenges with the Fast R-CNN:

Not fast Enough especially for large datasets.
Fast R-CNN performs better than R-CNN. But as we observe the performance of the Fast R-CNN during testing time it slows down while using the Region Proposals as compared to not using Region Proposals. Therefore, Region Proposals is the major reason resulting in the degradation of the performance of the Fast R-CNN.

Faster R-CNN

R-CNN and Fast R-CNN use selective search algorithm to find out the region proposals which is slow in terms of processing the regions and also a time consuming process.

Faster R-CNN is an object detection algorithm which eliminates the use of Selective Search Algorithm and let the network learn the region proposals.

The key difference between the Faster R-CNN and the prior object detection algorithms R-CNN and Fast R-CNN is the Region Proposal Network.

Key Takeaways from Faster R-CNN:

R-CNN and Fast R-CNN use selective search algorithm to find the region proposals which is slow.
Faster R-CNN donot use selective search algorithm, instead it directly let the network learn region proposals.
In Faster R-CNN a separate network is used to predict the region proposals instead of selective search algorithm.

Primary Challenges with Faster R-CNN:

Faster R-CNN gives the bounding boxes only but no semantic segmentation.

Mask R-CNN

Mask R-CNN is built using Faster R-CNN.

The main idea behind Mask R-CNN is to extend Faster R-CNN to pixel level segmentation.

In addition to bounding boxes and class label. Mask R-CNN outputs an object mask.

In Mask R-CNN, a fully convolutional Neural Network is added at the top of CNN features of Faster R-CNN which generates the mask output.

Mask R-CNN uses a trick called ROI Align to locate relevant areas down to pixel level. The backbone of Mask R-CNN is ResNet101.

YOLO (You Only Look Once)

R-CNN, Fast R-CNN, Faster R-CNN and Mask R-CNN use regions to localize the objects with in image. The network donot look at the complete image. Instead it look at some parts of the image which have the highest probabilities of containing the object. YOLO is an object detection algorithm which is different from the regions based algorithms (R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN). In YOLO, a single Convolutional Neural Network predicts the bounding boxes and the class probabilities for these boxes.

In YOLO, the input image is split into SxS grid, with in each of the grid we take m bounding boxes. For each of the bounding box, the network outputs a class probability and offset values for the bounding box. The bounding boxes having the class probability above the threshold value is selected and used to locate the objects with in the image.

Limitations:

The limitation of YOLO algorithm is that it struggles with small objects with in the image. The algorithm might not be able to detect very small object in the image, due to spatial constraints of the algorithm.

Courses & Projects

YOLOR Pro: course link
YOLOR Streamlit Dashboard: project link
Mask Detection using YOLOR: project link
Weeds Detection using YOLOR: project link
Car Counting on Lane using YOLOR: project link

YOLO+ Subscription

https://www.augmentedstartups.com/yolo-plus

Object Detection: A journey from R-CNN to Mask R-CNN and YOLO

Written by Muhammad Moin