Learning Day 66: Object detection 5 — YOLO v1, v2 and v3

Published in

dejunhuang

4 min readJun 20, 2021

YOLO v1 (You Only Look Once)

In previous object detection algorithms (eg. Faster R-CNN, R-FCN), there are two problems to solve: classification and regression
In YOLO, changed to a pure regression problem
Use a neural network to directly predict 1) bounding box (bbox) location, 2) probability of including an object, and 3) probability of a certain class.
Use NMS (Non-Maximal Suppression) to filter down boxes

The pretrained model with ImageNet has input size of 224x224, but is then resized to 448x448 for YOLOv1
It contains both conv and FC layers, unlike R-FCN
The final output is 7x7x30 (for Pascal VOC dataset which determines the last dimension 30. Will be explained later)

The first two dimensions, 7x7, indicate that YOLOv1 has conceptually divided the input image into 7x7 grid cells.

In each grid cell, its depth is 30 which is the third dimension of 7x7x30
The depth of 30 includes three components: bbox1 (depth=5), bbox2 (depth=5) and number of classes (depth=20)

Bbox1 & 2 (depth=5 each)

YOLO v1 uses two bboxes per grid cell.
First 4 layers contain the bbox1 locations, width and height: x, y, w, h.
The 5th layer contains a confidence score (probability of this grid cell containing any object and how well the bbox at this grid cell is positioned, from 0 to 1). If there is object inside, the score = IOU of predicted bbox and groundtruth.
The next 5 layers serve the same purpose for the 2nd bbox.
The last 20 layers is the probability of each object class at this grid cell as Pascal VOC dataset has 20 classes.
So the output dimension is generically written as: S*S*(B*5+C) where S is the grid size, B is the no. of bboxes and C is the no. of classes.

It is a summation of 5 terms that have been well explained in another post here.
2 terms take care of the bbox location and size, 2 terms take care of the confidence score, 1 term takes care of the object classification.

Advantage

Disadvantages

Sacrifice accuracy for speed.
Prone to bbox locating error.
Not fantastic for small object detection (since the grid size 7x7 is pretty coarse).

YOLOv2’s backbone is Darknet-19.

Batch normalization included.
Higher res classifier: Uses pretrained models that take in 448x448 image.
Convolution with anchor boxes (instead of just 2 bboxes).
Dimension cluster: use K-means to obtain the optimal no. of bboxes (resulted in B=5).
Direct location prediction: As YOLO with anchor boxes itself is unstable for predicting x, y of bboxes, use sigmoid to bound value to 0–1 to determine if the bbox should even be at a certain location. So bboxes that are too far from the groundtruth will be set to 0.
Fine-Grained features: object detection is done of a 13x13 feature map (instead of 7x7) to get finer details. At passthrough layer (similar to ResNet) to combine high res and low res features.
Multi-scale training: YOLOv2 becomes a fully conv layers network, which make this possible: every 10 epochs, randomly choose different input sizes for training. (if FC layer is present, input size is fixed).

Backbone is Darknet 53.
Different loss function: from softmax at v2 to logistic loss at v3.
Different no. of anchor boxes from 5 at v2 to 3x3 at v3.
Perform 3 detections. These detections are done at different feature maps of different sizes to detect features at different scales.