Introduction to Object Detection with RCNN Family Models

Published in

Analytics Vidhya

11 min readAug 28, 2021

In this post, you will discover a gentle introduction to the problem of object detection and state-of-the-art deep learning models designed to address it.

After reading this post, you will know:

Region-Based Convolutional Neural Networks, or R-CNNs, are a family of techniques for addressing object detection tasks, designed for model performance.

Let’s get started.

Overview

This article is divided into two parts; they are:

Basic terminologies and challenges faced in Object Detection
R-CNN Model Family

So, let’s understand with some terminologies:

Image Classification: Predict the type or class of an object in an image

· Input: An image with a single object, such as a photograph.

· Output: A class label (e.g. one or more integers that are mapped to class labels).

2. Object Localization: Locate the objects in an image and output their location with a bounding box.

· Input: An image with one or more objects, such as a photograph.

· Output: One or more bounding boxes (e.g. defined by a point, width, and height).

3. Object Detection: Locate the objects with a bounding box and types or classes of the located objects in an image.

· Input: An image with one or more objects, such as a photograph.

· Output: One or more bounding boxes (e.g. defined by a point, width, and height), and a class label for each bounding box.

Bounding boxes are parametrized with these components (x,y,w,h,confidence)

(x,y) = giving center of the box, w = width of the box, h = height of the box

Challenges in Object Detection

Multiple Outputs:

In case of image classification we have single output for every image, but here we need to output whole set of detected objects where each image might have many different objects in it. So, we need to build model that can output variably sized number of detections.

2. Multiple types of output:

We have 2 different types of output:

a. Category Label

b. Bounding Box

3. Computational Problem:

For object detection it typically requires to work on high resolution images. As we want to identify lot of different objects in image, we want enough spatial resolution on each of the objects so overall resolution of image needs to be quite higher.

Before deep diving into RCNN’s family we must understand the concept of Region Proposals

Idea of Region Proposal:

So the idea is that, if there is no way we can evaluate object detector on every possible region in an image, for that we can have external algorithm that can generate set of candidate regions in an image for us such that candidate regions gives small set of regions per image but has high probability of covering objects in an image.

One of the famous method for region proposals is this method called selective search. So, selective search algorithm would give you about 2,000 object proposals per image in couple of seconds of processing on CPU and these 2,000 region proposals that the algorithm would output would have very high probability of covering all interesting objects we care in image.

Once we have idea of region proposals it gives us the way to train object detectors with Deep Neural Networks. This brings us to very famous paper R-CNN.

Now, let’s see the RCNN’s Model Family.

Region-Based Object Detection (R-CNN):

This is like most influential paper in Deep Learning that came out in 2014. (Rich Feature hierarchies for accurate Object Detection and Semantic Segmentation) by Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun

Architecture and Working of RCNN:

Working of RCNN:

Step1: We start with input image and run region proposal method like selective search, by which we get 2,000 candidate region proposals in image that we need to evaluate

Step2: For each candidate regions, as region proposals can be of different sizes and different aspect ratio, so we are going to warp that region into fixed size, say (224x224)

Step3: For each warped image regions, we are going to run them independently through Convolutional Neural Network (CNN) and that CNN will output classification score for each of these regions

But, there is a slight problem here,

What happens if region proposals that we get from selective search do not exactly match to the objects that we want to detect in the image?

So, to overcome this problem,

Now CNN is going to output additional thing which is transformation that will transform region proposal box into final box that we want to output for object of our interest

Finally it would look something like this,

Finally the architecture of RCNN will look like this

Step1: Run region proposal method to compute 2,000 candidate region proposals

Step2: Resize each region to specific size (224x224) and run independently through CNN to predict class scores and bounding box transform

Step3: Use scores to select subset of region proposals to output

Step4: Compare with the ground truth boxes

Now, the question arises how to compare the prediction to ground truth box?

We can compare these bounding boxes with the metric called Intersection over Union (IOU)

IOU = (Area of Intersection) / (Area of Union)

More generally, IOU is measure of Overlap between the bounding boxes

If, IOU<0.5 → we say it ‘Bad’ IOU>0.5→ ‘descent’, IOU>0.7 → ‘Good’, IOU>0.9 → ‘Almost perfect’

Also there is another problem which is, the object detector often output multiple bounding boxes for same object. So how to solve this?

So solution for this is: Post process the raw detection using Non-Max Suppression (NMS)

NMS is the way for you to make sure that your algorithm detects objects only once.

So, what NMS does is that it cleans up other unwanted detections so we end up with one detection for particular object.

How does this NMS work?

1. First it looks for probabilities (Pc) associated with each of these detection for particular object

2. It takes largest ‘Pc’ which is most confident detection for the object

3. Having done that, the NMS part looks for all remaining bounding boxes and chooses all those bounding boxes which has high Intersection over Union (IOU) with the bounding box of highest ‘Pc’ and suppresses them.

4. Then we look for remaining bounding box and find highest ‘Pc’ and again NMS looks for remaining bounding boxes which has high IOU with bounding box of high ‘Pc’ and then they will get suppressed.

So for this example:

1. It takes largest Pc which is 0.9 in this case

2. It check IOU for all the remaining bounding boxes (i.e. for 0.6, 0.7 for Car 1 and 0.8, 0.7 for Car 2)

3. Now, NMS will suppress 0.6 and 0.7 for car 1 as they have high IOU with respect to bounding box of Pc=0.9, so like this we get only one bounding box for car 1 which is highlighted in the image.

4. Next, for remaining bounding boxes we have highest Pc=0.8 for car2 and again we check IOU for remaining boxes (i.e. 0.9 for car1 and 0.7 for car2)

5. Now, NMS will suppress 0.7 as it has high IOU with respect to bounding box of Pc=0.8. And we get only one bounding box for car 2 as well.

Now, in case of RCNN it is very slow and cannot be used in real-time.

Fast-RCNN

The way the researchers have made this fast is by swapping CNN and warping steps. Basically, we warp the images after we run CNN. So, by doing this we get Fast-RCNN. by Ross Girshick.

Let’s, see the working and architecture,

Step1: Take input image and process whole image with single CNN (without fully connected layers). So the output will be convolutional feature map giving us convolutional features. And this ConvNet we run is often called as backbone network (can be AlexNet, VGG, ResNet, etc.)

Step2: Run region proposal methods and crop & resize features

Step3: Run light CNN (meaning shallow network) per region

This is going to be fast, as most of the computation is going to happen in the backbone network and network we run on per region is going to be relatively small and light weight and fast to run.

What does it mean to crop and resize features? How to crop features?

It can be done via Region of Interest Pooling (RoI Pooling)

Its purpose is to perform max pooling on inputs of non-uniform sizes to obtain fixed-size feature maps (e.g. 7×7).

Let’s just understand this by example,

Let’s consider a small example to see how it works. We’re going to perform region of interest pooling on a single 8×8 feature map, one region of interest and an output size of 2×2. Our input feature map looks like this:

Let’s say we also have a region proposal (top left, bottom right coordinates): (0, 3), (7, 8). In the picture it would look like this:

Normally, there’d be multiple feature maps and multiple proposals for each of them, but we’re keeping things simple for the example.
By dividing it into (2×2) sections (because the output size is 2×2) we get:

Note that the size of the region of interest doesn’t have to be perfectly divisible by the number of pooling sections (in this case our RoI is 7×5 and we have 2×2 pooling sections).
The max values in each of the sections are:

And that’s the output from the Region of Interest pooling layer. Here’s our example presented in form of a nice animation:

The result is that from a list of rectangles with different sizes we can quickly get a list of corresponding feature maps with a fixed size. Note that the dimension of the RoI pooling output doesn’t actually depend on the size of the input feature map nor on the size of the region proposals. It’s determined solely by the number of sections we divide the proposal into.

What’s the benefit of RoI pooling? One of them is processing speed. If there are multiple object proposals on the frame (and usually there’ll be a lot of them), we can still use the same input feature map for all of them. Since computing the convolutions at early stages of processing is very expensive, this approach can save us a lot of time.

Now, to make this Fast-RCNN more fast the researchers added Region Proposal Network after the backbone network. This gives us,

Faster-RCNN

(Towards Real Time Object Detection with Region Proposal Networks)

Here we are going to eliminate algorithm called selective search and instead train Convolutional Neural Network to predict our Region Proposals for us. The way we are going to do that is very similar to Fast RCNN except after we run backbone network we are going to insert a tiny network called Region Proposal Network(RPN) that will be responsible for predicting region proposals.

Basically the working for Fast-RCNN and Faster-RCNN is the same after we get region proposals.

Step 1: Run input image through backbone network and get image level features

Step 2: Pass image level features to RPN to get our region proposals

Step 3: Crop the region proposals by ROI Pooling

Step 4: Pass warped feature to light CNN for predicting final classification and bounding box transformations

Here the question is how we can use CNN to output region proposals ?

Architecture for Region Proposal Network

The CNN image features coming out of backbone network are all aligned to positions in the input. So, then what we can do is each point in CNN feature map we can imagine anchor box which just slides around the image and we place anchor box at every position in CNN feature map coming out of backbone network.

Now, our task is to train little CNN that will classify these anchor boxes as either containing an object or not containing an object.

Here we have one problem, so the question is, What if anchor box may have wrong shape or aspect ratio?

So solution for this is, we use ‘k’ different anchor boxes which are of different shape and different aspect ratio at each point in an image. So, scale, size, number of anchor boxes are the hyper-parameters for object detection.

Basically, Faster-RCNN is two staged process:

1st stage: Object Detector consisting

Backbone Network

Region Proposal Network

2nd stage: Run once per Region where we,

Crop Features → Region of Interest Pooling

Predict Object Class

Predict Bounding Box offset

Now, the question arises, Do we really need a second stage?

And it kind of seems we could actually get away with using just 1st stage and ask this 1st stage to do everything. And this will simplify system a bit and make it even faster because we would not have to run these separate computation per region.

So there is a method for object detection called Single Stage Object Detector which basically looks like Region Proposal Network (RPN) in case of Fast-RCNN and rather than classifying anchor boxes as object or not object instead it will make full classification decision for category of object.

So this gives us the way for the two single stage detectors,

YOLO (You Only Look Once)
Single Shot Multibox Detector

We will see YOLO and its implementation in next article.

YOLO for Object Detection, Architecture Explained!

In the previous article Introduction to Object Detection with RCNN Family Models we saw the RCNN Family Models which…

medium.com

Summary

In this post, you discovered a gentle introduction to the problem of object detection and state-of-the-art deep learning models designed to address it.

Specifically, you learned:

You learnt basic difference between Image Classification, Object Localization and Object Detection
You learnt Region-Based Convolutional Neural Networks, or R-CNNs, are a family of techniques for addressing object detection tasks, designed for model performance.

References:

R-CNN Family Papers

Rich feature hierarchies for accurate object detection and semantic segmentation, 2013.
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, 2014.
Fast R-CNN, 2015.
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, 2016.
ROI Pooling
Andrew NG’s YouTube video