Deep Neural Networks for Object Detection

Philippe Beaudoin

Published in

Academic Origami

8 min readJun 11, 2016

Christian Szegedy, Alexander Toshev, and Dumitru Erhan. NIPS 2013.

Link to full paper
Part of my Academic Origami series. How do I read this?

The goal of this paper is to use deep neural networks to find these nice little “bounding boxes” you often see around objects or faces in computer-analyzed images.

The approach used here looks obsolete, the popular ones these days are Region-based Convolutional Network (R-CNN) and MultiBox. I still read this as it looked like one of the first DNN + localization paper. Also, the idea of using a mask to express the bounding box, as proposed here, is one you might come up with is you were faced with this problem.

In a nutshell: the output of the DNN is going to be a mask of “binary pixels” with 1 where the object is and 0 where it isn’t. The mask has a significantly lower resolution than the image.

In addition to the full object mask the DNN outputs four other masks covering the left, top, right and bottom of the object. This makes it possible to identify two adjacent objects. In this implementation, the original image is 225×225 and the mask is 24×24.

The result is a bit too coarse, so they refine it by cropping the image to a smaller window, then sliding this window across the image and merging the masks reported by each of these sliding windows. (The second line in Fig.2.)

There is another refinement process indicated in Fig.2 with the “refine” arrow: the algorithm “zooms in” on the detected bounding box, crops the image within it, and applies the DNN localizer again.

You can see some gray pixels in the mask on figure 2. It’s because a mask pixel value somewhere between 0 and 1 indicates that the real bounding box occupies only a fraction of the area covered by that mask pixel.

In this excerpt, the [14] is a seminal paper introducing the first deep Convolutional Neural Net (CNN) that kicked ass in the ImageNet competition. I’ve already origamized that paper, in case you want to learn more about it. The CNN in [14] outputs a vector over the classes of objects it wants to detect. Only one of these output should be close to 1 while the others should be close to 0, which is achieved using softmax.

In the current paper, the authors want to output a d×d image of mask pixels where each can have a value between 0 and 1. Therefore the softmax layer is replaced by a regression layer which, as far as I can tell, is just a layer of d×d regular neurons with no non-linearity applied.

That weird equation is the cost function used to train the DNN. The Diag(m)… part just means that every output where m = 1 will be multiplied by sqrt(1 + 𝜆) and every output where m = 0 will be multiplied by sqrt(𝜆). This way, mask pixels equal to 1 are weighted more heavily than those equal to 0. This makes sure the network doesn’t tend to make every mask pixel equal to 0 when the object is small. Note that the smaller the 𝜆, the more important the bias towards mask pixels equal to 1. I couldn’t find which value they used for 𝜆.

In section 5.1, the authors add a twist to their algo by using 5 masks instead of 1: a mask for the full bounding box, one for the top half of the bounding box, one for the bottom half, one for the left half and one for the right half. To understand which problem this solves imagine an image with two objects X and Y side-by-side versus an image with a single object Z. Their full object masks would look like this:

From these masks there is no way to disambiguate both cases. On the other hand, if you generated the left and right masks you’d get:

Where the optimal choice is two bounding boxes in the first case and only one in the second.

Equation (1) caused me a small headache the first time I read the paper. It simply indicates how to calculate the value of a mask pixel m at coordinates i, j given a bounding box bb and a type of mask h (full, top, bottom, left or right). The equation says that m should be equal to the percentage of the bounding box overlapping the mask pixel (when h = full) or the percentage of the half-bounding-box 0verlapping the mask pixel (when h = something else). A case of math-induced obfuscation if you ask me.

This paragraph basically says that the five d×d masks can be generated using a unique DNN with 5×d×d neurons on the output layer. The last sentence is strange to me: I was assuming they trained a single network with all the bounding boxes, no matter which class of object was in the image. Now it seems I might be wrong and they are training as many different networks as there are classes. This may mean 1000 different networks for today’s datasets!

This section explains how to find the bounding boxes (ie. the top-left and bottom-right pixels in the original image) from the five masks obtained at the previous stage.

For the full bounding box (h = full) they look for the bounding box bb that maximizes:

The 1/area term keeps the bounding-box as small as possible. The term in the summation is maximal when the value of the mask pixel m(i,j) equals the area of the bounding box covered by that mask pixel.

For the half-bounding-boxes (h = top, bottom, left or right), the equation to maximize is:

The first term in the sum ensures that the half-bounding-box precisely matches the mask pixels = 1, while subtracting the second term ensures that the the other-half-bounding-box (h-bar in the equation) matches the mask pixels = 0.

From this paragraph it appears that the full exhaustive search is too complex, so they limit it to 90 different bounding box shapes that are slid across the image with a stride of 5 pixels. Rather than defining the bounding box shapes with their width and height, they use the mean dimension (average of width and height) and the aspect ratio (width/height). This ensures that the bounding boxes are scaled relative to the image. They select 10 aspect ratios by analyzing the shape of the boxes in the training data.

Remember that a mask can encode multiple different bounding boxes whereas the exhaustive search looks at one bounding box at a time. To solve that problem, they try all the possible masks and keep the ones with strong enough scores. They then apply the classifier and discard the bounding boxes that do not work with the classifier. The non-maximum suppression algorithm is easy: first sort all bounding boxes by score then select them in order, dropping any bounding box that overlaps an existing one by more than 50%.

Again, the sentence “W.r.t to the class of the current detector” makes it sound like they need to train one neural network per class. This sounds overkill.

Section 5.3 explains how they apply the DNN on multiple crops of the image: the full image, and a number of sliding windows at 50% and 25% of the full image size. These sliding windows overlap by 20%.

For inference, they apply the DNN to all cropped images, at all scales (typically less than 40 cropped images). They obtain three masks, one for each of the three scales (100%, 50%, 25%), by merging the masks generated by all the sliding windows at that scale. Merging is done by taking the maximum of all the mask pixels at coordinate i, j. They then use the mechanism in section 5.2 to identify the top 5 bounding boxes per scale, obtaining 15 bounding boxes.

Another pass is made by “zooming in” on these 15 bounding boxes to obtain finer bounding boxes. First, the bounding box is enlarged by a factor of 1.2, probably to capture the fringes of the image, in case some information is present there. Then the image is cropped w.r.t. the enlarged box. Then DNN is run on that cropped image. I imagine they do not apply the multi-scale refinement at that point, just the basic DNN. Also, I’m guessing they only keep the the top bounding box.

I’m not sure I fully understand section 6, so this is my interpretation.

They train two things: first the classifier then the mask generator. The classifier is the CNN they use for pruning in section 5.2. I find it strange that they explain how they train it here, I had the impression it was the basic CNN of [14].

For training the mask generator, I believe they generate a training sets composed of pairs of (image crop, mask). 40% of these pairs are positive examples (the mask matches an object in the image crop) and 60% are negative examples (the mask doesn’t match anything).

The part about training the classifier I don’t get. Why are they talking about bounding boxes in this paragraph if it’s used to detect whether an image crop contains a given class of object?

Deep Neural Networks for Object Detection

Written by Philippe Beaudoin