Deeplearning.ai CNN week 3: Object detection

Heading to the YOLO algorithm

Published in

datatype

5 min readFeb 6, 2018

The same idea as object localization, the difference is returning the “point” instead of bounding box.

Data label now in pair of x,y coordinates of landmarks which want to be learned. After passing through some ConvNet.
Labels have to consitant over training images.

In testing phase, detection by sliding the windows(different size) over image → push to ConvNet → predict.

The main disadvantage, many possible crop in the image in different scale→ computational cost. Running under ConvNet as take time to predict in one crop image → slow.

The main target is finding an appropriate way to keep the same network follow (input/output size) in each step, but not to use the fully connected layer anymore.
Remember the “network inside network” : fully connected layer == 1x1 convolution. Then by using an “suitable” way, this target can be achieved.
For the first fully connected layer of 400 nodes, due to its input is 5x5x16, we can use 400 convolution filters of the size 5x5x16. The result is 400 elements of the size 1x1 or 400 nodes.
The next fully connected layers is rebuilt by using 1x1 convolution way, quite simple :).
BUT the convolution filters ITSELF also sliding over regions of image to return the input. Then, the convolution is equivalent to the sliding window operator by an appropriate way.

For example, the test image 16x16x3 is 2 pixels more than the trained images. If putting them directly to the trained model, the final output is a 2x2x4 tensor where each 1x1x4 is corresponding the response of a region of 14x14x3 of the input testing image.

The size of sliding window is critical, as sequentially capture a region by region, there are no 100% chance to catch an object at a certain window size.

YOLO algorithm (you only look one time)
For example: for an image of 100x100, divide into 3x3 grid, for the training data, each grid rectangle need to be labeled as “y” of 8 variables.
p_c: probability response of having object or not
b_x, b_y : the coordinate of the center point of the object bounding box.
b_h, b_w: the height and weight of the bounding box
c_1, c_2, c_3: binary respone of each label class.

Then the output of this YOLO is a 3x3x8 tensor. Using the deep learning framework as usual, just modify the way of output.
It is image classification + localization + convolutional implementation.
Encode b_x, b_y, b_h, b_w information. By the fraction of it over the regional box.

One object can be found in many boxes, causing the overlapping of object recognize.
It is the size of the intersection / the union

One object can be detected multiple times. Need to be clean up for the final result.

Take the bounding box with the highest probability, get rid the rest of lower probability.

What about overlapping objects ? How can recoginize multiple things in the same grid box?
Predefine anchor boxes, each box for a type of shape, replicated the same 8 variable label structure into “y” more.

Algorithm with 2 anchor boxes: Each object in training image is assigned to the grid cell that contains object’s midpoint and anchor box for the grid cell with highest IoU.

Choosing how many type of anchor boxes in term of shape and number is tough. Almost by human prior.
Luckily, it is rarely happen that two objects appear in a grid cell IF the grid is 19x19 for 100x100 image size.
Can use K-mean algorithm to cluster all of anchor box and finallize them later.

Putting all above components together. For example: classify three objects: car, pedestrian, motorcycle.

There are two kind of anchor boxes: the red bounding box of the car is more IoU with the 2nd anchor box, then the label for the grid containing car as the above image (no information for the first 8 variables)
Making the prediction