Deep Learning for Object Detection and Localization using R-CNN

A deep dive into R-CNN for object detection

Saurabh Bagalkar
Sep 9 · 8 min read


Object detection is at the forefront of computer vision research, primarily due to its plethora of practical applications in almost every field. With technology progressing so fast, it is important to develop a deep understanding of what these methods are and how they have evolved over time as we innovate beyond them. This is the first piece in a new series that will cover the concepts behind some very powerful state-of-the-art object detection techniques and localization methods that use region based object detectors. Scene understanding, of which object detection is a major part, helps to automate a lot of computer vision tasks, such as pedestrian detection, vehicle detection and localization and traffic sign recognition among others. We will start with the R-CNN proposed by Ross Girshick et al. which is a precursor for understanding the more efficient and widely used successors like Fast R-CNN and Faster R-CNN and eventually Mask R-CNN which will be covered later in this series

What is R-CNN?

R-CNN is an object detection and localization method. It aims to “find” objects of interest in an image and draws bounding boxes around them while also categorizing its class. The name R-CNN stands for Regions with CNN (Convolutional Neural Networks) features.

R-CNN attempts to categorize each object in an image with its associated bounding box. Image source

Motivation Behind R-CNN

Object Recognition has been traditionally done using SIFT and HOG features. These methods, although very successful, breakdown when there are images that have other objects, or considerable noise around them. As we know, recognition in human visual cortex system is much more complicated than just identifying gradients and clustering them, which HOG tends to do. We need a more sophisticated, hierarchical and multistage process if we are going to get to human level visual recognition.

CNN’s classify images by extracting their high and low level features. Image source

Why is Object Detection and Localization a non-trivial task ?

Localizing an object can be framed as a regression problem by itself but that is not terribly effective because it yields poor results and is very slow. Another approach is to use a sliding-window detector, traditional for CNN’s, but as the layers and subsequent strides increase, resolution is lost. This makes precise localization challenging, not to mention the fact that it is too slow for many real world applications. Another issue with using the sliding-window approach is that if you need to capture different aspect ratios of different regions within an image, spatial location will vary greatly. This will also blow up the computational power.

The sliding window approach is very cumbersome, when used to localize cars in an image. Image source

R-CNN building blocks and architecture

To deal with the aforementioned localization issues that slow down not just the detection speed but also the accuracy, R-CNN’s were introduced which use a very clever approach to detection and localization.

  1. Feature extraction using CNN- For image classification of each region
  2. Classification and Localization
  • Bounding-box regression for localization
R-CNN modules. Image source

Region Proposals

A clever way to select coherent regions within an image. Image source
  • Texture
  • Size
  • Shape Compatibility
Selective Search uses a bottom up approach to propose similar regions. Image source

Feature Extraction using CNN

Since the popularity of AlexNet proposed by Krizhevsky et al, CNN’s have become hugely popular for feature extraction from images. This module extracts a 4096-dimensional feature vector for each region proposed by region proposals module. The overarching steps for feature extraction are as follows:

  • Pass, i.e. forward propagate the individual resized regions through pre-trained AlexNet to get a 4096-dimensional feature vector of each region with each image. Finally a 1000 length classification layer is replaced by (N+1) length layer where N is number of classes and an extra one is for the background.

Classification and Localization

Object classification is performed using a linear SVM model

  • The classifier learns during training which objects to focus on, as we explicitly capture a false positive and tell the classifier to consider it as a negative example for the next epoch of the training cycle, thus increasing the robustness of the classifier.
Transformation between predicted and ground truth coordinates. Image source

Recap of R-CNN process

In a nutshell, R-CNN does the following:

R-CNN process
  1. The CNN extracts a 4096 dimensional vector of each object in an image thus creating a 2000 x 4096 matrix.
  2. This feature vector is passed onto a binary SVM, which is trained for each class separately, thus giving the classification of the object.
  3. Precise localization is achieved using a least squares regression model, which minimizes the differences between the proposed coordinates and the ground truth coordinates by correcting the offset values.

Why R-CNN needs improvement?

Although, R-CNN was a breakthrough paper at its time, it faced several drawbacks:


Alegion is the gold standard for machine learning data labeling, designed to help enterprise extract value from their data with increased efficiency through human and machine intelligence. Scale image, video, text and audio annotations at high quality — even for complex use cases

Saurabh Bagalkar

Written by

Machine Learning Researcher with interest in Computer Vision,Deep Learning, Localization and the field of perception in general



Alegion is the gold standard for machine learning data labeling, designed to help enterprise extract value from their data with increased efficiency through human and machine intelligence. Scale image, video, text and audio annotations at high quality — even for complex use cases

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade