How has object detection evolved through the years with Deep Learning?

Published in

BEYOND DATA by LittleBigCode

7 min readDec 19, 2022

Object detection finds itself particularly useful in many realms including people detection, surveillance, and cancer localization. However, object detection is harder than a simple classification as you need in addition to draw a precise box around the object(s) of interest. Moreover, there might be many objects of interest, meaning that the outputs’ size varies for each task, and each image while classification has a fixed number of outputs. Understanding the evolution of the detection algorithm, the reasons that lead to this evolution, and the means that were used toward that is paramount to keep improving the knowledge in this field and the performances of the algorithms. This is what we will explore in this article.

By Théo Dupuis, Data Scientist at LittleBigCode

A first used approach was to draw bounding boxes centered around each image pixel and to classify each bounding box separately using a CNN (Convolutional Neural Network). Then, the problem becomes a simple classification problem. However, each object on the image can have a different size and proportion, therefore, at each pixel, we need an impressive number of bounding boxes if we want to be sure we do not miss any particular object.

Multiplying the bounding boxes means increasing the number of images to classify and the computation time. This “naïve” approach quickly becomes too slow to be effectively used. Therefore, many strategies have been imagined to find efficient algorithms in terms of performance, memory space, and computation time. Among them, we can find the R-CNN that then gave birth to Fast and Faster R-CNN.

Example of lesion detection on chest images
Image from chest-xray-images · GitHub Topics · GitHub

How does R-CNN work?

Instead of dealing with a vast number of boxes to classify, R-CNN is a region-based convolutional network. Instead of classifying all boxes around each pixel, only a subset of images (around two thousand images) is going through a deep ConvNet (another name for CNN) to get classified. R-CNN uses a selective search to extract a small number of regions of interest. Those regions of interest are called region proposals.

The selective search works as a basic segmentation algorithm. It takes random regions at first and will group these small regions into bigger ones based on a score of similarity between neighboring pixels. We won’t elaborate much on this selective search the main idea being to group regions to reduce the number of areas on which we use the CNN.

Mechanism of R-CNN
Image from Analytics Vidhya

The training of the R-CNN

The training of the R-CNN is still awfully expensive in terms of memory space and time. This is mainly because the training is made in three steps:

First, R-CNN fits a ConvNet/CNN on object proposal;
Then, R-CNN fits SVM’s to classify the proposals;
Finally, R-CNN fits a regression on the bounding boxes’ edges to refine the detection.

Moreover, the prediction is also slow (around 40s for one image) as R-CNN extracts features from each object proposal (around 2,000) through a forward pass of a ConvNet which takes some time.

How to get a Fast R-CNN?

The high computation time of the R-CNN is due to the thousands of region proposals going through the neural network to extract features. A simple way to counteract that is to inverse the feature extraction phase and the region proposal phase. Doing so, the image goes through the ConvNet only once and the region proposal is then made on the feature map directly.

Mechanism of Fast R-CNN
Image from Analytics Vidhya

However, inversing those two steps in the code is not as simple as it seems. The region proposal phase returns boxes of varied sizes and shapes which then go through fully connected layers accepting only fixed-size inputs. Therefore, Ross Girshick imagined a new layer named the RoI pooling layer acting as a max pooling layer accepting feature maps of different sizes and max pooling them into a unique size feature map.

Mechanism of RoI Pooling
Image from Medium

In addition to that, Fast R-CNN replaces the SVM classification with a softmax layer which reduces computation time while improving the performance of the algorithm. Overall, Fast R-CNN appears to be around 10 times faster than R-CNN during the training and testing phases.

What about Faster R-CNN?

Following the improvements made in object detection by Fast R-CNN, we almost reach real-time decision-making capability. The region proposal part has now become the bottleneck in terms of time effectiveness.

Faster R-CNN uses a region proposal network (RPN) to get the RoI instead of the traditional Selective Search. This idea there is to use most of the already-in-place network that outputs the feature map. Therefore, the region proposal network is a convolutional network that shares with the backbone network a certain amount of convolutional layers. On top of these layers, a small network is added to output a set of rectangular object proposals, each with an objectness score.

Mechanism of RPN
Image from the official paper “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”

The RPN is a deep network that shares most of its layers with the backbone network. It then applies a sliding window of k anchor boxes of distinct sizes and shapes on the feature map to identify objects of various aspects. Therefore, for each position of the sliding window, the RPN outputs 2*k (k being the number of anchors) scores of objectiveness and 4*k coordinates of the associated bounding boxes.

Faster R-CNN works in 4 steps

Pass the input through a ConvNet to get a feature map;
Use a RPN on the feature map to get region proposals with an objectness score;
Use a RoI pooling layer (inherited from Fast R-CNN) to the region proposal to bring them down to the same size;
Pass the region proposals through a fully connected layer with a softmax to classify the output in parallel with a linear regression layer to get to refine the bounding boxes.

Mechanism of Faster R-CNN
Image from Toward Data Science

However, generating region proposals with a convolutional network is not as easy as it can seem. First, a new loss function must be defined to train the RPN. This loss function is divided into a classification loss and a bounding box loss, and must not consider all anchors evaluated.

In fact, anchors are considered to correspond to an object depending on their intersection over union ratio with the ground truth objects. But most of the anchors won’t match any objects and considering all those anchors as negative would create an unbalance in the classes during training. Therefore, the loss function is a bit more intricate than first expected.

Moreover, as the RNP and the backbone network share layers, the training of the common layers requires some attention as both RPN and the Fast R-CNN backbone trained independently will modify the shared convolutional layers in different ways.

Therefore, a 4 steps training pipeline has been implemented:

The RPN is trained alone;
A separate (with no common layers) Fast-RCNN is trained according to the region proposal of the RPN of step 1;
The RPN is initialized by using the layers of the Fast-RCNN of step 2, we freeze the common layers and fine-tune the layers unique to the RPN;
Finally, keeping the shared con layers frozen, we fine-tune the layers of the Fast-RCNN.

R-CNN at a glance

Train and Test time comparison between the different algorithms
Image from Toward Data Science

Conclusion

As we have seen, the task of detection has evolved through the years to provide today toward much faster and much more performant algorithms. From a naïve approach where millions of areas of an image had to go through a deep network toward a more complex model able to generate a region of interest and evaluate their relevancy, we have been able to divide the training time for this task by ten and the test time by a hundred which is essential for many tasks.

The evolution in the field does not stop there, many new and more performant models have been developed since Fast R-CNN. While algorithm as Yolo enables an almost instantaneous prediction (especially useful for autonomous cars for example where you need a real-time response), others such as Retina-Net and Retina U-Net have focused on increasing the accuracy of such algorithms (particularly useful for the medical realm for example where the error is not permitted). The research is still going in this field with new implementations, new architecture, loss functions, a learning rate scheduler, etc. developed every week.

Some useful links

Consult all the articles of LittleBigCode by clicking here: https://medium.com/hub-by-littlebigcode

Follow us on Linkedin & Youtube + https://LittleBigCode.fr/en