NeuroNuggets: Object Detection

Published in

Neuromation

11 min readMar 19, 2018

This is the second post in our NeuroNuggets series. This is a series were we discuss the demos already available on the recently released NeuroPlatform. But it’s not as much about the demos themselves as about the ideas behind each of these models. We also meet the new Neuromation deep learning team that we hired at our new office in St. Petersburg, Russia.

The first installment was devoted to the age and gender estimation model, the simplest neural architecture among our demos, but even there we had quite a few ideas to discuss. Today we take convolutional neural networks for image processing one step further: from straightforward image classification to object detection. It is also my pleasure to introduce Aleksey Artamonov, one of our first hires in St. Petersburg, with whom we have co-authored this post:

Object Detection: Not Quite Trivial

Looking at a picture, you can recognize not only what objects are on it, but also where they are located. On the other hand, a simple convolutional neural network (CNN) like the one we considered in the previous post cannot do that. All they can is estimate the probability with which a given object is present on the image. For practical purposes, this is insufficient: on real world images, there are lots of objects that can interact with each other in nontrivial ways. We need to ascertain the position and class of each object in order to extract more semantic information from the scene. We have already seen it with face detection:

The simplest approach that was used before the advent of convolutional networks consisted of a sliding window and a classifier. If we need, say, to find human faces on a photo, we first train a network that says how likely it is that a given picture contains a face and then apply it to every possible bounding box (a rectangle in the photo where an object could appear), choosing the bounding boxes where this probability is highest.

This approach actually would work pretty well… if anyone could apply it. Detection with a sliding window needs to look through a huge amount of different bounding boxes. Bounding boxes have different positions, aspect ratios, scales. If we try to calculate all the options that we should look through, we get about 10 million combinations for a 1Mpix image. Which means that the naive approach would need to run the classification network 10 million times to detect the actual position of the face. Naturally, this would never do.

Classical Computer Vision: the Viola-Jones Algorithm

Our next stop is an algorithm that embodies classical computer vision approaches to object detection. By “classical” here we mean computer vision as it was before the deep learning revolution made every kind of image processing into different flavours of CNNs. In 2001 Paul Viola and Michael Jones proposed an algorithm for real-time face detection. It employs three basic ideas:

Haar feature selection;
boosting algorithm;
cascade classifier.

Before describing these stages, let us make sure that we actually want the algorithm to achieve. A good object detection algorithm has to be fast and have a very low false-positive rate. We have 10 million possible bounding boxes and only a handful of faces on the photo, so we cannot afford a false positive rate much higher than 10–6 unless we want to be overwhelmed by incorrect bounding boxes. With this in mind, let us jump into the algorithm.

The first part is the Haar transform; it is best to begin with a picture:

We overlap different filters on our image. Activation of one Haar filter is the sum of the values in the white parts of the rectangle minus the sum of values under black part.

The main property of these filters is that they can be computed across the entire image very quickly. Let us consider the integral version (I*) of the original image (I); the integral version is the image where intensity at coordinate I*x,y is the total intensity over the whole rectangle that begins at the top left corner and ends in (x,y):

Let us see the Haar filter overlapped on the image and its integral version in action:

Haar features for this filter could be computed in just a few operations. E.g., the horizontal Haar filter activation shown on the Terminator’s face equals 2C — 2D + B — A + F — E, where the letters denote intensities at the corresponding points on the integral image. We won’t go into the formulas but you can check it yourself, it’s a good exercise for understanding the transform.

To select the best Haar features for face recognition, the Viola-Jones algorithm uses the AdaBoost classification algorithm. Boosting models (in particular, AdaBoost) are machine learning models that combine and build upon simpler classifiers. AdaBoost can take weak features like Haar that are just a little better than tossing a coin. But then AdaBoost learns a combination of these features in such a way that the final decision rule is much stronger and better. It would take a whole separate post to explain AdaBoost (perhaps we should, one day), so we’ll just add a couple of links and leave it at that.

The third, final idea is to combine the classifiers into a cascade. Thing is, even boosted classifiers still have too much false positives. A simple 2-feature classifier can achieve almost 100% detection rate, but with a 50% false positive rate. Therefore, the Viola-Jones algorithm uses a cascade of gradually more complex classifiers, where we can reject a bounding box on every step but have to pass all checks to output a positive answer:

This approach leads to much better detection rates. Roughly speaking, if we have 10 stages in our cascade, each stage has 0.3 false positive rate and 0.01 false negative rate, and all stages are independent (this is a big assumption, of course, but in practice it still works pretty well), the resulting cascade classifier achieves (0.3)10 ~ 3*10–6 false positive rate and 0.9 detection level. Here is how a cascade works on a group photo:

Further research in classical computer vision went into detecting objects of specific classes such as pedestrians, vehicles, traffic signs, faces etc. For complex detectors on deep phases of a cascade we can use different kinds of classifiers such as histogram of oriented gradients (HOG) or support vector machines (SVM). Instead of using Haar features on image in grayscale we can get image channels in different color schemes (CIELab or HSV) and image gradients in different directions. All these features are computed in the integral space, summing inside a rectangle with an adjustable threshold.

An important problem that appeared already in classical computer vision is that near a real object, our algorithms will find multiple intersecting bounding boxes; you can draw different rectangles around a given face, and all will have rather high confidence. To choose the best one classical computer vision usually employs the non-maximum suppression algorithm. However, this still remains an open and difficult problem because there are many situations when two or more objects have intersecting bounding boxes, and a simple greedy implementation of non-maximum suppression would lose good bounding boxes. This part of the problem is relevant for all object detection algorithms and still remains an active area of research.

R-CNN

Initially, in object detection tasks neural networks were treated as tools for extracting features (descriptors) on late stages of the cascade. Neural networks by themselves had always been very good in image classification, i.e., prediction of the class or type of the object. But for a long time, there was no mechanism to locate this object at the image with neural networks.

With the deep learning revolution, it all changed rather quickly. By now, there are several competing approaches to object detection that are all based on deep neural networks: YOLO (“You Only Look Once”, a model that was initially optimized for speed but ), SSD (single-shot detectors), and so on. We may return to them in later installments, but in this post we concentrate on a single class of object detection approaches, the R-CNN (Region-Based CNN) line of models.

The original R-CNN model, proposed in 2013, performs a three-step algorithm to do object detection:

generate hypotheses for reasonable bounding boxes with an external region proposal algorithm;
warp each proposed region in the image and pass it through a CNN trained for image classification to extract features;
pass the resulting features to a separate SVM classification model that actually classifies the regions and chooses which of them contain meaningful objects.

Here is an illustration from the R-CNN paper:

R-CNN brought the deep learning revolution to object detection. With R-CNN, mean average precision on the Pascal VOC (2010) dataset grew from 40% (the previous record) up to 53%, a huge improvement.

But improved object detection quality is only part of the problem. R-CNN worked well but was hopelessly slow. The main problem was that you had to run the CNN separately for every bounding box; as a result, object detection with R-CNN took more than forty seconds on a modern GPU for a single image! Not quite real-time. It is also very hard to train because you had to juggle together the three components, two of which (CNN and SVM) are machine learning models that you have to train separately. And, among other things, R-CNN requires an external algorithm that can propose bounding boxes for further classification. In short, something had to be done.

Fast R-CNN

The main problem of R-CNN was speed, so when researchers from Microsoft Research rolled out an improvement there was no doubt how to name the new model: Fast R-CNN was indeed much, much faster. The basic idea (we will see this common theme again below) was to put as much as they could directly into the neural network. In Fast R-CNN, the neural network is used for classification and bounding box regression instead of SVM:

In order to make detection independent of the size of the object in the image, R-CNN uses Spatial Pyramid Pooling (SPP) layers that had been introduced in SPPnet. The Idea of SPP is brilliant: instead of cropping and warping the region to construct an input image for a separate run of the classification CNN, SPP crops the region of interest (RoI) projection at deeper convolutional layer, before fully-connected layers.

This means that we can reuse lower layers of the CNN, running it only once instead of a thousand times in basic R-CNN. But a problem arises: a fully-connected layer has a certain size, and the size of our RoI can be anything. The task of the spatial pyramid pooling layer is to solve this problem. The layer divides the window into 21 parts, as shown in the figure below, and summarizes (pools) the values in each part. Thus, the size of the layer’s output does not depend on the size of the input window any more:

Fast R-CNN is 200 times faster than R-CNN to apply to a test image. But it is still insufficient for actual real-time object detection due to an external algorithm for generating bounding box hypotheses. On a real photo, this algorithm can take about 2 seconds, so regardless of how fast we make the neural network, we have at least a 2 second overhead for every image. Can we get out of this bottleneck?

Faster R-CNN

What could be faster than Fast R-CNN? The aptly named Faster R-CNN, of course! And what kind of improvement could we do to Fast R-CNN to get an even faster model? It turns out that we can get rid of the only external algorithm left in the model: extracting region proposals.

The beauty of Faster R-CNN is that we can use the same neural network to extract region proposals. We only need to augment it with a few new layers, called the Region Proposal Network (RPN):

The idea is that the first (nearest to input) layers of a CNN extract universal features that could be useful for everything, including region proposals. On the other hand, the first layers have not yet lost all the spatial information: these features are still rather “local” and correspond to relatively small patches of the image. Thus, RPN uses precomputed feature values from early layers to propose regions with objects for further classification. RPN is a fully convolutional network, and there are no fully-connected layers in the RPN architecture, so the computing overhead is almost nonexistent (10ms per image), and we can now completely remove the region proposal algorithm!

One more good idea is to use anchor boxes with different scales and aspect ratios instead of a spatial pyramid. At the time of its appearance, Faster R-CNN became the state of the art object detection model, and while there has been a lot of new research in the last couple of years, it is still going pretty strong. You can try Faster R-CNN at the NeuroPlatform.

Object Detection at the NeuroPlatform

And finally we are ready to see Faster R-CNN in action! Here is a sequence of steps that will show you how this works on a pretrained model which is already available at the NeuroPlatform.