Evolution of Object Detection

Published in

Analytics Vidhya

11 min readNov 1, 2020

Computer vision has advanced considerably but is still challenged in matching the precision of human perception. However it is always good to see how far we have come.The task of detecting and recognizing an unknown number of individual objects within an image,called object detection ,which was considered an extremely difficult problem only a few years ago, is now feasible and has even been productized by companies like Google and IBM.

In this story we will review the history of object detection from “traditional object detection period (before 2014)” to “deep learning based detection period (after 2014)”.

Traditional Object Detection era:

Turning the clock 20 years back we would witness “the wisdom of cold weapon era”. Due to the lack of effective image representation at that time,most of the early object detection algorithms were built based on handcrafted features.

Viola Jones Detectors:

Developed in 2001 by Paul Viola and Michael Jones,this object recognition framework allows the detection of human faces in real-time.It uses sliding windows to go through all possible locations and scales in an image to see if any window contains a human face.The sliding windows essentially searches for ‘haar-like’ features (named after Alfred Haar who developed the concept of haar wavelets).

Thus the haar wavelet is used as the feature representation of an image.To speed up detection, it uses integral image , which makes the computational complexity of each sliding window independent of its window size.Another trick to improve detection speed that was used by the authors is to use Adaboost algorithm for feature selection which selects a small set of features that are mostly helpful for face detection from a huge set of random features pools.The algorithm also used Detection Cascades which is a multi-stage detection paradigm to reduce its computational overhead by spending less computations on background windows but more on face targets.

HOG Detector :

Originally proposed in 2005 by N. Dalal and B. Triggs,Hog is an improvement of the scale invariant feature transform and shape contexts of its time.HOG works with something called blocks(similar to a sliding window) ,a dense pixel grid in which gradients are constituted from the magnitude and direction of change in the intensities of pixels within the block. HOGs are widely known for their use in pedestrian detection. To detect objects of different sizes, the HOG detector rescales the input image for multiple times while keeping the size of a detection window unchanged.

Deformable Part-based Model (DPM):

DPM was originally proposed by P. Felzenszwalb in 2008 as an extension of the HOG detector.Later a variety of improvements have been made by R. Girshick. The problem of detecting a “car” can be broken down as a ‘divide and conquer’ strategy by detecting its window, body, and wheels.DPM uses this strategy. Training process involves learning a proper way of decomposing an object, and inference involves ensembling detections of different object parts.

DPM detector consists of a root-filter and a number of part-filters. A weakly supervised learning method is developed in DPM where all configurations(size , location etc.) of part filters can be automatically learned as latent variables.To improve detection accuracy, R. Girshick used a special case of Multi-Instance learning for this purpose , and some other important techniques such as “hard negative mining”, “bounding box regression”, and “context priming”.Later he even used a technique that implements a cascade architecture, which has achieved over 10 times acceleration without sacrificing accuracy.

DeepLearning era:

Unfortunately object detection has reached a plateau after 2010 as the performance of hand-crafted features became saturated.However in 2012, the world saw the rebirth of convolutional neural networks and deep convolutional networks were successful at learning robust and high-level feature representations of an image.The deadlocks of object detection was broken in 2014 by the proposal of the Regions with CNN features (RCNN) for object detection.In this deep learning era, object detection is grouped into two genres: “two-stage detection” and “one-stage detection”.

RCNN:

It starts with the extraction of a set of object proposals (object candidate boxes) by selective search.Then each proposal is rescaled to a fixed size image and fed into a pre-trained CNN model to extract features.Finally, linear SVM classifiers are used to predict the presence of an object within each region and to recognize object categories.

Although RCNN does much better than traditional methods, its has several drawbacks.The redundant feature computations on a large number of overlapped proposals (over 2000 boxes from one image) leads to an extremely slow detection speed.Also the selective search algorithm is a fixed algorithm. Therefore, no learning is happening at that stage. This could lead to the generation of bad candidate region proposals.

SPPNet:

In 2014, K. He et al. proposed Spatial Pyramid Pooling Networks.Conventionally, at the transition of convolution layer and fully connected layer, there is one single pooling layer or even no pooling layer. In SPPNet, it suggests to have multiple pooling layers with different scales.Also previous CNN models require a fixed-size input.The Spatial Pyramid Pooling (SPP) layer in SPPNet enables a CNN to generate a fixed-length representation regardless of the size of image/region of interest without rescaling it.

The diagram above illustrates the process.We see that the input image goes to SPPNet using convolution network only once.Selective Search is used to generate region proposals just like in R-CNN.At the last convolution layer, feature maps bounded by each region proposal is going into SPP layer then FC layer.

Compared with R-CNN, SPPNet processes the image at conv layers for only one time while R-CNN processes the image at conv layers for as many times as there there are region proposals.However drawbacks include : training is still multi-stage and SPPNet only fine-tunes its fully connected layers while simply ignores all previous layers.

Fast RCNN:

Compared to an R-CNN model, a Fast R-CNN model uses the entire image as the CNN input for feature extraction, rather than each proposed region.Selective search is applied on the image and suppose it generates n proposed regions, their different shapes indicate regions of interests (RoIs) of different shapes. Fast R-CNN introduces RoI pooling, which uses the CNN output and RoIs as input to output a concatenation of the features extracted from each proposed region and fed into a fully connected layer During category prediction, the shape of the fully connected layer output is again transformed to n×q and we use softmax regression (q is the number of categories and n is the number of proposed regions). During bounding box prediction, the shape of the fully connected layer output is again transformed to n×4. This means that we predict the category and bounding box for each proposed region.

The reason “Fast R-CNN” is faster than R-CNN is because we don’t have to feed all region proposals to the convolutional neural network every time. Instead, the convolution operation is done only once per image and a feature map is generated from it.

Although Fast-RCNN successfully integrates the advantages of R-CNN and SPPNet, its detection speed is still limited by the proposal detection

Faster RCNN:

In 2015, S. Ren et al. proposed Faster RCNN detector shortly after the Fast RCNN.It is the first end-to-end, and the first near-realtime deep learning object detector.All of the above algorithms(R-CNN,SPPNet & Fast R-CNN) uses selective search to find out the region proposals. Selective search is a slow and time-consuming process affecting the performance of the network.Faster R-CNN eliminates the selective search algorithm and lets the network learn the region proposals.

Similar to Fast R-CNN, the image is provided as an input to a convolutional network which provides a convolutional feature map. Instead of using selective search algorithm on the feature map to identify the region proposals, a separate network is used to predict the region proposals. The predicted region proposals are then reshaped using a RoI pooling layer which is then used to classify the image within the proposed region and predict the offset values for the bounding boxes.

As a part of the Faster R-CNN model, the region proposal network is trained together with the rest of the model. In addition, the Faster R-CNN objective function includes the category and bounding box predictions in object detection, as well as the category and bounding box predictions for the anchor boxes in the region proposal network. Finally, the region proposal network can learn how to generate high-quality proposed regions, which reduces the number of proposed regions while maintaining the precision of object detection.

Although Faster RCNN breaks through the speed bottleneck of Fast RCNN, there is still computation redundancy at subsequent detection stage. Later, a variety of improvements have been proposed, including RFCN and Light head RCNN

Feature Pyramid Networks(FPN):

In 2017, T.-Y. Lin et al. proposed Feature Pyramid Networks.If we dig into Faster RCNN, we see that it is mostly unable to catch small objects in the image.To solve this a simple image pyramid can be used to scale image to different sizes and send it to the network. Once the detections are detected on each scale, all the predictions can be combined using different methods.

Before FPN, most of the deep learning based detectors run detection only on a network’s top layer. Although the features in deeper layers of a CNN are beneficial for category recognition, it is not conducive to localizing objects.A topdown architecture with lateral connections is developed in FPN for building high-level semantics at all scales. Since a CNN naturally forms a feature pyramid through its forward propagation, the FPN shows great advances for detecting objects with a wide variety of scales.

FPN has now become a basic building block of many latest detectors.

You Only Look Once (YOLO):

All of the previous object detection algorithms use regions to localize the object within the image. The network does not look at the complete image, instead it looks at parts of the image which have high probabilities of containing the object.

YOLO trains on full images and directly optimizes detection performance.With YOLO, a single CNN simultaneously predicts multiple bounding boxes and class probabilities for those boxes.It also predicts all bounding boxes across all classes for an image simultaneously.It divides the input image into an S × S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box it predicted is.

Each bounding box is represented by 6 numbers (pc,bx,by,bh,bw,c),where pc is the confidence of an object being present in the bounding box;bx,by,bh,bw represents the bounding box itself;c is a vector containing the class probabilities.We also define Anchor boxes by exploring the training data to choose reasonable height/width ratios that represent the different classes.An anchor box can have multiple bounding boxes.For each anchor box (of each grid cell) we compute the element-wise product pc*c[i] and extract a probability score of the box containing a certain class.The class associated with the maximum score is assigned to the anchor box along with the score itself.

We can then get rid of boxes with low score by using a threshold of probability score.However ,we still get a lot of boxes.So we use something called Non-maxima suppression(NMS) in which we select only one box when several boxes overlap with each other and detect the same object.

In spite of its great improvement of detection speed, YOLO suffers from a drop of the localization accuracy compared with two-stage detectors, especially for some small objects. YOLO’s subsequent versions(YOLO V2,YOLO V3 and the latest YOLO V4) and the latter proposed SSD(Single Shot MultiBox Detector) has paid more attention to this problem.

Single Shot MultiBox Detector (SSD):

SSD was proposed by W. Liu et al. in 2015.Then in November 2016 the paper about SSD: Single Shot MultiBox Detector was released by C. Szegedy et al. which reached new records in terms of performance and precision for object detection tasks.It is a one-step object detector just like yolo. The main contribution of SSD is the introduction of the multi-reference and multi-resolution detection techniques,which significantly improves the detection accuracy of a one-stage detector, especially for some small objects..Released after YOLO and Faster RCNN, this paper achieves 74.3 mAP at 59fps for 300x300 input size images.This network is called SSD300. Similarly SSD512 achieves 76.9% mAP surpassing Faster R-CNN results.

It starts with Base network to extract feature maps.A standard pretrained network is used for high quality image classification and truncated before any classification layers. In their paper, C. Szegedy et al. used VGG16 network. Other networks like VGG19 and ResNet can be used and should produce good results.Then Multi-scale feature layers which is a series of convolution filters are added after the base network. These layers decrease in size progressively to allow predictions of detections at multiple scales.Then Non-maximum suppression is used to eliminate overlapping boxes and keep only one box for each object detected.

RetinaNet:

It is discovered that there is extreme class imbalance problem during training of dense one-stage detectors. And it is believed that this is the central cause, despite of its high speed and simplicity ,why the performance of one-stage detectors is inferior to two-stage detectors.A new loss function named “focal loss” has been introduced in RetinaNet in which lower loss is contributed by “easy” negative samples so that detector will put more focus on hard, misclassified examples during training.Focal Loss enables the one-stage detectors to achieve comparable accuracy of two-stage detectors while maintaining very high detection speed.

With ResNet+FPN as backbone for feature extraction, plus two task-specific subnetworks for classification and bounding box regression, RetinaNe achieves state-of-the-art performance and outperforms well-known two-stage detectors like Faster R-CNN.

Thats all for now.

We have briefly discussed the evolution of Object detection which is a very challenging,highly complex as well highly evolving domain in computer vision.Every year, new algorithms keep on outperforming the previous ones.Today, there is a plethora of pre-trained models for object detection.Also object detection has found its application in many interesting fields,including Tracking objects(like tracking a ball during a match in the football world cup),Automated CCTV surveillance,Person Detection(used in intelligent video surveillance frameworks),Vehicle Detection etc.It has also found its application in autonomous driving,which is one of the most interesting and highly anticipated innovations of the modern era.

References:

Object Detection in 20 Years: A Survey → https://arxiv.org/pdf/1905.05055.pdf
RCNN → https://arxiv.org/pdf/1311.2524.pdf
Fast RCNN → https://arxiv.org/pdf/1504.08083.pdf
Faster RCNN → https://arxiv.org/pdf/1506.01497.pdf
Yolo → Coursera DeepLearning Specialization by Andrew Ng.
HOG → https://www.learnopencv.com/histogram-of-oriented-gradients/
SSD → https://arxiv.org/abs/1512.02325
FPN → https://jonathan-hui.medium.com/understanding-feature-pyramid-networks-for-object-detection-fpn-45b227b9106c
SPPNet → https://arxiv.org/abs/1406.4729

Evolution of Object Detection

Traditional Object Detection era:

DeepLearning era:

References:

Written by Chinmoy Borah