Object Detection With Deep Learning: RCNN, Anchors, Non-Maximum-Suppression

Published in

The Startup

10 min readOct 3, 2020

Object detection in images is a common task in computer vision. Its use cases vary from missile guidance to automated production lines, console games and cleaning robots. Object detection algorithms in computer vision have been around long before their migration to deep learning:

In the 2001 paper Rapid Object Detection Using a Boosted Cascade of Simple Features by Viola and Jones, they present a method to slide rectangular hand-crafted filters over the image to extract features correlated with facial features. The extracted features are fed to a classifier which learns which features are more significant than others, but does not modify the filters themselves. In fact, this task is termed ‘image classification’ in today’s jargon, because it lacks the localization prediction of the object in the image, but it demonstrates how this process is done, and in principle — you could split the original image into many patches and run the proposed classifier on each one of them and in that manner localize the object (but, you would have to use many assumptions on the size of objects in images).

Sliding hand-crafted filters to extract facial features. Source: Rapid Object Detection using a Boosted Cascade of Simple Features, Viola and Jones, 2001

The advances in both training data availability and compute power turned the attention to implementing object detection with neural networks. The most remarkable advantage of neural networks over rule-based algorithms is their ability to learn the filters that represent complex attributes and mixing coefficients for best performance.

Before we dive into the first implementations of object detection with deep learning, let’s formulate the problem we’re trying to solve:

Given an image with an object contained in a box GT (short for Ground Truth) we would like our model to predict a bounding box BB such that:

IoU(BB, GT) > threshold

Intersection over Union (IoU): a — IoU = 0.16 (poor), b — IoU = 0.9 (good)

That task is called localization. Most detectors are also required to predict the class of the object in the bounding box. That task is called classification.

How are these tasks performed jointly?

It is assumed you are familiar with the basic concepts of object classification, convolution layers and training, so we will only briefly state the main attributes of classification networks:

The input to a classification network is assumed to contain a single object. The classifier, except its last two layers, is comprised of a structure called Feature Extractor (F.E.). There are many variations for the F.E. architecture (e.g. ResNet, DenseNet, Mobilenet etc.) but basically a F.E. is a sequence of convolution layers with activation functions producing multiple feature maps that flow down the network, and occasionally a spatial compression operation called maxpooling (for the purpose of keeping the amount of computation reasonable during the feature expansion). During training the F.E. learns to represent significant attributes in different feature maps. While the first layers are limited to representing only simple attributes like borders and simple shapes, the network learns to combine them in complex ways so that the final features of the F.E. can represent high-level attributes such as ‘this area has metallic texture’, ‘this area has wheels’, ‘this area has pointy ears’, ‘this area has furry texture’ etc.

The output of the feature extractor still preserves some spatial information, although at low resolution. The final segment of the classifier takes the final feature maps and replaces each map with its spatial-average value (an operation called global average-pool). This operation destroys the spatial information and gives the average intensity of each attribute in the original image, regardless of location. The last layer is a fully-connected (F.C.) layer that learns the best coefficients for linearly combining the attribute intensities in a way that predicts the object class. For example, the coefficients the classifier will learn for combining the ‘metallic texture’ and ‘this area has wheels’ features will be such that a strong intensity in both features will result in the class prediction: ‘car’.

ResNet-18 classifier architecture. Note the global average pool preceding the fully connected layer on the far right. Everything that comes before that is the feature extractor. Source: Deep Residual Learning for Image Recognition

Detection Networks

RCNN

One of the earliest models that addressed object detection using deep learning is the RCNN. Introduced in 2014 by Girshick et al., RCNN (Regions with CNN features) divides this task into several modules, each in charge of a simpler sub-task:

The first module, which is non-neural, is a Selective Search algorithm (presented by Uijlings et al. in 2012). This non-trainable algorithm scans the image and generates crops which are likely to contain an object, based on multiple hand-crafted attributes. The algorithm assigns each of these Regions of Interest (RoI’s) with an “objectness score”, corresponding to the confidence that this patch indeed contains an object with a high enough IoU. These RoIs already localize the object by definition.
A feature extractor (which was harvested from an already trained classification network) is then fed with the resized RoI patches to generate classification information for each patch separately.
A classification head predicts the class probabilities for the object in the RoI.

RCNN (Regression head not shown). Source: Rich feature hierarchies for accurate object detection and semantic segmentation

At the time of its publication, RCNN was state-of-the-art in object detection metrics and scored higher than any existing (non-CNN-based) method for object detection. However, it had several drawbacks:

The region proposal mechanism was non-trainable.
The selective search produced thousands of RoI’s with high overlapping for each image. This means that each pixel passed through the feature extractor performing the same computations thousands of times per image. The result was a very long processing time: a single image took 47 seconds to process.

Girshick went on to improve the model with the Fast-RCNN and later collaborated with more researchers to develop the Faster-RCNN.

Fast-RCNN

Fast-RCNN also starts with a non-trainable algorithm that generates proposals for objects. But the main achievement is that the image only passes once through the feature extractor. The patches corresponding to the RoI’s are cropped from the F.E. output and fed into two parallel branches: the first one is a classifier head, containing several convolutions followed by a global average pool and a fully connected layer, and the second branch is a small network trained with a regression loss — that predicts the scale and center of the object contained in that RoI. In a sense — this branch finetunes the coordinates proposed by the original selective search. The clever way to avoid multiple passes through the F.E. gained a speedup of x150 (0.32 seconds per image) compared with RCNN. Note that there is still some inefficiency, since the overlapping pixels in the F.E. output are processed from scratch for each RoI, but it’s nevertheless a major improvement.

We won’t elaborate on loss mechanisms, but I would like to note that the box regression process is philosophically different from classification, in at least one sense: While in classification training we feed the model with 1-hot GT vectors and it learns to reproduce this vector as best it can, in the box regression task, where the object size is, let’s say, (30, 55) pixels, the values (30, 55) appear nowhere in the ground truth, and yet the network learns to produce them in order to achieve a high IoU with the ground truth object.

Faster-RCNN

At the heart of the Faster-RCNN is the understanding that the representation power of the feature extractor is strong enough, so that an exterior RoI generator is not required. The entire image is fed once to a feature extractor and its output diverges to a classification head and a Region Proposal Network (RPN) which generates proposals and objectness scores. This time the box regression head receives the F.E. output for the entire image, and not crops corresponding to RoI’s. A non-neural code loops over the proposals — for each proposal the F.E. output corresponding with that patch is sent to a classifier.

Comparing results of different architectures is a tricky business: Many test sets have been developed and updated over the years for detection challenges, and different researchers may quote their model’s results on different datasets. We can fairly compare models by using an up-to-date framework that supports the training and testing of various models. Once such framework is GluonCV (developed by Amazon). In their Detection Model Zoo, are many models that were trained and tested in comparable conditions.

Detector performance chart. Source: GluonCV

As one can see, Faster RCNN models (with varying feature extractors) achieves the highest mean Average Precision. However, these architectures are extremely slow. Other architectures, like YOLO and CenterNet also achieve high accuracy but are much faster. The SSD model family (using VGG and Mobilenet feature extractors) are superseded by these model.

Anchors

Now we are ready to discuss the mechanism that actually performs the bounding box prediction. The most commonly used method uses anchors, which are priors, or predefined basic box predictions.

Each output cell in the box-regression head of a detector is assigned with several (usually 3 or 5) anchors differing in shape. An anchor is a (W, H) box (W, H are different for the 5 anchors) centered at the center of the output cell.

The model learns the necessary transformations (offset from pixel center, width and height scaling relative to anchor) for those anchors in order to capture objects successfully.

Illustration of anchors. Each anchor has a different shape and results in a separate set of output activations for the bounding box and the class distribution. In the feature map, the 3 colors indicate 3 anchors. Each containing 4 features — 2 for offsets, and 2 for scales. A separate output layer (not shown here) gives the class confidence distribution associated with each of the anchors.

The exact sizes and aspect ratios of the anchors are hyper parameters of the model, and can be determined after a statistical research of the object shape distribution in the training data (via a k-means analysis, for example).

During training the detector learns that several features that are spatially close to each other should result in a box that encompasses them (rewarding the network with a low loss). Several anchor geometries are used simultaneously to help the network capture objects with different aspect ratios, e.g. a short and wide anchor will likely learn to capture a car more easily than a toll and slim anchor.

Using N anchors enables the network to predict up to N objects centered in the same pixel (e.g. a cat napping on a couch) without having to compromise and predict a cat-couch class mix.

Cat and Couch objects centered at the same pixel. Original photo by Steve Mitchell on Unsplash

One drawback of using anchors is that each object gives rise to multiple bounding boxes from which we need to pick the best-fitting one.

An even more severe drawback is that since each spatial output cell is forced to predict bounding boxes even if the confidence is extremely low, we end up with hundreds or thousands of bounding boxes per image, by design.

For example — Assume an 80-class detector with input shape 224x224 and output shape 7x7 cells, each having 3 anchors. Then each image will result in 7x7x3x80=11,760 bounding boxes (there is a bounding box copy per class. See the next section for the reason).

The commonly used method to sift through the bounding boxes and pick only the interesting ones is called NMS.

NMS — Non-Maximum Suppression

Object detection with multiple proposals per object. Only the green boxes are desired as actual detections. Source: Learning non-maximum suppression, Hosang, Benenson and Schiele, 2017

This process is considered a part of the post processing.

NMS is a greedy algorithm that loops over all the classes, and for each class it checks for overlaps (IoU — Intersection over Union) between all the bounding boxes. If the IoU between two boxes of the same class is above a certain threshold (usually 0.7), the algorithm concludes that they refer to the same object, and discards the box with the lower confidence score (which is a product of the objectness score and the conditional class probability).

The loop is performed over the classes, since we want to keep the ability to detect classes that aren’t with the highest score, in case the highest-scoring class copy will be discarded due to its high overlap with another box.

High overlap with different classes. We don’t want to drop the lower class scores. Photo by Tyler Nix on Unsplash

NMS is very expensive computationally: If there are n boxes then the algorithm performs O(n²) comparisons and quickly becomes a serious bottleneck, especially if the input image is large and the processor is weak.

Several ways have been invented to mitigate this NMS bottleneck. One is to sift through the boxes and remove the ones with very low confidence level even before the NMS. This helps get rid of about 90% of the boxes on VOC-Pascal dataset, for example. While this method is useful, it still requires O(n) operations and the remaining high confidence boxes still require the NMS.

Another approach is to abandon the anchor methodology and use a prediction architecture that inherently generates less proposals. CenterNet is such an architecture (which I’m personally fond of) and I will elaborate on it in a future post.