Published in

Region Proposal Network(RPN) & RoI Pooling: From Image Classification to Object Detection

I used to wonder what the difference is going from image classification to object detection.


In my previous internship as an AI Research intern, my main duty involved mostly computer vision tasks, specifically image classification ones. Image classification wasn’t easy to begin with, but it is still a relatively simple and straightforward concept of identifying the image with the most dominant object in the image. However, going from image classification to object detection takes a bit more than just identifying one single dominant object in the image as there could be a lot more objects that we want to recognize in the image. After spending some time with object detection, I now know the difference between image classification and object detection. Let’s break it down in this article.

From image classification to object detection

1. Extracting features using CNN

A Convolutional Neural Network (CNN) is an artificial neural network used mostly in computer vision tasks for the purpose of recognizing objects or classifying images based on a single object in the image. It could very well be considered as one of the building blocks for more complex computer vision tasks like instance segmentation.

While there have already been thousands of articles or tutorials on the topic of CNN, let’s revise the concept so that we can put it to use in object detection tasks later on. Remember, a CNN architecture consists of three main layers:

  • Convolutional layer, which helps in retrieving the feature maps from the input image with the help of filters and kernels
  • Pooling layer, which helps in reducing the dimensions of the feature maps by downsampling them
  • Fully connected layer, which helps in connecting every neuron in a layer to all other neurons in other layers that later make the prediction

As a result, we get an architecture like the one displayed in the image above. Fairly simple. However, the question comes: How do we find and recognize multiple objects in a single image, given convolutional layers are only able to extract features of an image, not of an object? Enter Region-based CNN (R-CNN).

2. Recognizing different objects within a single image using R-CNN

As we already know, CNNs are able to extract feature maps from images, hence in order to perform object detection, we simply have to treat each object as an image itself. By drawing a bounding box around the object of course!

Say, we have 2000 objects in a single image. If we are able to draw the same amount of bounding boxes around the objects, we are able to extract individual feature maps and classify them individually. The problem here then, is how should the machine learning model initialize the bounding boxes? For this purpose, the author of R-CNN proposed a method of using selective search to only extract 2000 regions, also known as region proposals. Ideally, we could then feed these 2000 regions into a CNN and have it extract the feature maps before classifying them with a Support Vector Machine (SVM).

Selective Search Algorithm:

1. Generate initial sub-segmentation within the input image based on this paper.2. Recursively combine smaller regions into larger ones using greedy algorithm3. Generate the final candidate region proposals using the combined larger regions

The next problem that arose from this step is that this is a computationally expensive step, given there are 2000 regions to feed into the network. Plus, selective search is an offline algorithm, where no learning takes place. Fast R-CNN solved the first problem while Faster R-CNN fixed the second. Here’s how.

3. Fast R-CNN vs Faster R-CNN

Fast R-CNN is similar to R-CNN, with basically only one modification — instead of feeding 2000 region proposals to the CNN to obtain individual feature maps, Fast R-CNN reverses the step and generates a convolutional feature map first before identifying the region proposals from the feature map. Then, an operation called Region of Interest Pooling (RoI Pooling) reshapes the region proposals into fixed-size feature maps. Here’s what RoI Pooling does under the hood:

  1. An Nx5 matrix representing a list of regions of interest (RoI) is generated. N represents the number of RoIs while the first of the five columns represents the image index and the remaining four represent the bounding box coordinates.
  2. Each RoI is divided into fixed-size sections, say the output of this layer has a dimension of 2x2, then each RoI is divided into sections of 2x2. Max-pooling is performed for each divided section and essentially, we are able to obtain feature maps with a fixed size corresponding to a list of regions with different sizes.

Here, I’ll link to the animated explanation that helped me understand this concept of RoI pooling even better!

Remember the second downside? With selective search, it is an offline algorithm and it is impossible to make it an end-to-end trainable network. In Faster R-CNN, instead of selective search, the authors proposed to use a small network to predict region proposals, hence Region Proposal Network (RPN) is introduced. RPN has both a classifier and a regressor, where the classifier gives the probability of the region containing the object and the regressor gives the coordinates of the bounding boxes. Notice how these two are like any machine learning algorithms that have loss functions? With loss functions, the RPN is a trainable network and it can learn to generate regions of interest!

Now that the fundamentals of object detection specifically using R-CNN are out of the way, let’s try to get our hands on training an object detection model using TensorFlow.

4. TensorFlow Object Detection

Implementing object detection using TensorFlow is rather simple too, given we now know that we only need to pass each frame as an input to the detection model of our choice. The TensorFlow website has an object detection example for images, but here we try to make slight modifications to the code to take in each frame as input, effectively turning image classification into object detection.

Helper functions:

Loading the Faster R-CNN + Inception v2 object detection model:

Running inference on each frame:


Now, the inference speed is relatively low compared to other object detection models. For a complete comparison, you can find the performance metrics of other object detection models here, but we will talk about the difference between Faster R-CNN and the other detection models in terms of their speed and accuracy in real-time object detection in the next article before training our own custom object detection model.

5. What’s next?

As shown above, object detection is extremely simple and straightforward nowadays especially with built-in TensorFlow modules and countless tutorials online that provide step-by-step guides. Object detection paves the way for more complex computer vision tasks in our real-world environment because sometimes, we want to know which pixel exactly in the image constitutes the target object. Instance and semantic segmentation are both computer vision tasks that attempt to label each pixel instead of drawing bounding boxes over interested objects within the image. In a future article, we will see how we can first create a custom image dataset for the purpose of segmentation, then train a model to create pixel-wise masks that tell us which pixel belongs to which object.


Here, I have shown how in 4 steps we can essentially turn an image classification task to an object detection task. We also explored how a Region Proposal Network and Region of Interest (RoI) Pooling work, leading to the invention of Fast R-CNN and Faster R-CNN. However, Faster R-CNN is known for its notoriously slow speed, hence we will explore different models that give faster results but with slightly lower accuracy. Stay tuned!


Comparison between R-CNN vs Fast R-CNN vs Faster R-CNN

Region Proposal Networks

Region of Interest (RoI) Pooling

TensorFlow Object Detection



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Embedded Software Engineer and Indie Game Developer