My Machine Learning Diary: Day 71

10 min readDec 28, 2018

I finally got the basic ideas of YOLO today. I would like to summarize YOLO and what I learned yesterday about object detections.

Object Localization

We don’t just want to know if a given image contains a cat or not, but also want to know where the cat is in the image. Object localization is defined as finding out where an object is given an image.

In order to do object localization, we first need to adjust output y so that it also contains information about the location of the object. Concretely, y would something like the follow:

For the bounding box, we treat the top left corner of the image as (0, 0) and bottom right corner of the image as (1, 1). If there is no object in the image, we expect Pc is 0, and we don’t care the values of the rest. If there was a car found in the image, we expect Pc is 1 and one of cᵢ is 1.

No Object Found (Left) and Car Found (Right)

Landmark Detection

We can also detect specific points of a particular object. For example, we can detect where the eyes are in human faces or where the joints are in poses.

Detecting Eyes (Left) and Joints (Right)

These points in the objects are called landmarks. One of the applications of landmark detection is augmented reality where we can put a pair of sunglasses or a hat with landmarks of eyes and heads.

Object Detection

Object localization can only detect one object. But we want to detect more than one. Object detection is about finding out multiple objects give an image.

Sliding Windows

One method of object detection is to use sliding windows. First, we draw a small square region then place it at the top left corner of the image. Then we feed that portion of image into CNN and perform object localization. We slide the window a little bit to the right and do the same thing. When we slide the window all the way to the bottom right, we change the size of the window and repeat the process.

There are some problems with this approach. Sliding windows are computationally too expensive. Moreover, the windows may miss the objects if the stride is too big and the window is too small. If we make the stride small or choose more windows, that makes the algorithm even more expensive to process.

Sliding Windows with Convolutions

Fortunately, we can solve the problem of the expensive computations. To understand its implementation, we first need to know how to convert fully connected layers into convolutional layers.

Converting FC layers to Convolutional layers

We can express 400 units of fully connected layer as 1x1x400 convolutional layer. The conversion can be done with convolution with maximum filter size. For example, we can convert 5x5x16 into 1x1x400 with 5x5 filter. 1x1 filter can convert from 1x1x400 to 1x1x400. The only difference is that there is an extra linear computation we need to do during the conversion.

Now let’s see how to do sliding windows with convolutions. For illustration purpose, we only draw the 2D part of it.

As we can see above, this particular combination of convolution operations on 16x16x3 is actually simultaneously computing convolutions of sliding windows with size 14 and stride 2.

Now the computation of the sliding windows is not that expensive anymore. But we still have a problem figuring out the window size and its stride.

YOLO (You Only Look Once)

YOLO is an algorithm developed by Joseph Redmon et al. in 2015. This algorithm was motivated by the fact that humans can just glance at the image once and figure out where objects are. Instead of choosing multiple windows and strides and try CNN, YOLO only requires CNN once to do them all.

[Reference]
Original Paper

Bounding Box Predictions

The idea is that we are going to draw grid lines as the follow:

For illustration purpose, let’s draw split the images into 3x3. Each cell is responsible for detecting one object. A cell is responsible of an object if its center is within the cell. For the picture above, the cells in (1, 3) and (2, 1) are responsible for detecting objects. In practice, we split the image into 19x19 grid cells so that no more than one object’s center will fall within the same cell.

We are going to use the technique from sliding windows with convolutions to simulate the grids. Concretely, the input image should end up with dimension of 3x3xd.

So the top right grid of the cuboid is responsible detecting a center of an object within the top right grid of the image.

Each cell is responsible for detecting one center of an object

What does the cuboid represent exactly? It contains the type of we saw in object localization section. Concretely, if c₁ is a pedestrian, c₂ is a car, and c₃ is a motorcycle, we expect the cuboid to be as the follow:

Note the center of the object must be within the cell, but the boundary box can exceed the cell.

Why Does it Work?

Now we may wonder why a small cell can detect the object? It seems it’s impossible to figure out the center and the boundary box of an object, given a small portion of the image, say just the door handle of the car. Indeed, it’s impossible. But what’s actually happening is that the outputs are produced based on the whole image. This is the point where Andrew’s lecture wasn’t clear and caused some confusions. If we look at the original paper, we convert the image into cuboid not just with convolution layers but also fully connected layers.

Indeed, when the cuboid is 7x7x1024, each grid only has information about a small portion of the image. But when it gets to the fully connected layers, all the portions of the image are combined and carefully distributed to 7x7x30 cuboid. Therefore, it looks like a cell is able to detect objects outside of its region. The convolutional layers are where the algorithm extracts features from the input image, and the fully connected layers are where it performs object classification with detection.

Intersection Over Union (IOU)

To understand the loss function of YOLO algorithm, we need to understand how to evaluate object detection with Intersection Over Union (IOU). IOU is a measure of the overlaps between two bounding boxes.

The value of IOU is defined as (size of intersection / size of union). IOU ranges from 0 to 1.

Non-max Suppression

If we just perform bounding box predictions, each cell is going to predict a bounding box. If the grid is 19x19, then we are going to have 361 bounding boxes.

Is this the output of YOLO algorithm? No. We know each cell is only responsible for one object. There can be multiple cells detecting the same car. This will make the algorithm to predict multiple bounding boxes for one object.

To get the final output, we need to do something called non-max suppression.

In non-max suppression, we first discard all boxes with low Pc (Probability the cell contains the center of an object). Then we pick the box with the largest Pc as the output, and discard any remaining boxes with high IOU with the box we picked. Then we pick another box with largest Pc again and repeat the process until there are no boxes left.

The two light blue bounding boxes will be the final output the YOLO algorithm. Then we will use this output to compute the cost.

Anchor Boxes

Before we talk about the loss function, there is one more problem with the YOLO algorithm so far. What if two objects shares the centers?

The solution is simple. We predict two bounding boxes per cell, and the output y should change accordingly as shown above. Now, instead of each cell being responsible for detecting one object’s center, each anchor box is responsible for it.

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall. (Original Paper)

In the picture above, type 1 anchor box is responsible for detecting the human and type 2 anchor box is responsible for detecting the car.

Loss Function

For simplicity we are going to use square error metric for the loss function. As the authors point out, this is not a good choice of error metric.

We use sum-squared error because it is easy to op- timize, however it does not perfectly align with our goal of maximizing average precision. It weights localization er- ror equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on. (Original Paper)

To remedy this problem, we are going to penalize the bounding boxes that are responsible for object detections more. Let’s say the the grid is 2x2.

Each cell will predict B bounding boxes. For simplicity, let’s say B=2. Each cell will predict two bounding boxes.

Let’s look at the loss where the cells doesn’t contain the centers of objects, namely, cells (1, 1) and (2, 2).

For bounding boxes without object in the cell, we only care about Pc, so the loss only depends on Pc.

Now let’s look at the loss where the cells contain the centers of objects, namely, cells (1, 2) and (2, 1). As the authors write, we don’t penalize all those bounding boxes. But only penalize the ones that are responsible for object detection.

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the con- ditional class probability discussed earlier). It also only pe- nalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell). (Original Paper)

This is because we may want to use other boxes to detect other objects in the same cell. So we pick the bounding boxes that are responsible for object detections then compute the loss of each bounding box as the follow:

For the size loss, the reason we don’t use the width and height directly is the follow

Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly. (Original Paper)

In other words, the bounding boxes just need to be precise enough for large objects but it needs to be pretty precise for small objects.

R-CNN

Another popular object detection algorithm is R-CNN, which stands for Region-Convolutional Neural Network. Recall the problem of sliding windows was that it’s computationally too expensive to search all the region in the image. The idea of R-CNN is that we first perform region proposal to the image to find out what regions of the image might contains objects. Then we run CNN only for these regions of the image.