Learning Day 66: Object detection 5 — YOLO v1, v2 and v3
Published in
4 min readJun 20, 2021
YOLO v1 (You Only Look Once)
- In previous object detection algorithms (eg. Faster R-CNN, R-FCN), there are two problems to solve: classification and regression
- In YOLO, changed to a pure regression problem
- Use a neural network to directly predict 1) bounding box (bbox) location, 2) probability of including an object, and 3) probability of a certain class.
- Use NMS (Non-Maximal Suppression) to filter down boxes
Network architecture
- The pretrained model with ImageNet has input size of 224x224, but is then resized to 448x448 for YOLOv1
- It contains both conv and FC layers, unlike R-FCN
- The final output is 7x7x30 (for Pascal VOC dataset which determines the last dimension 30. Will be explained later)
How to interpret output 7x7x30
- The first two dimensions, 7x7, indicate that YOLOv1 has conceptually divided the input image into 7x7 grid cells.
- In each grid cell, its depth is 30 which is the third dimension of 7x7x30
- The depth of 30 includes three components: bbox1 (depth=5), bbox2 (depth=5) and number of classes (depth=20)
Bbox1 & 2 (depth=5 each)
- YOLO v1 uses two bboxes per grid cell.
- First 4 layers contain the bbox1 locations, width and height: x, y, w, h.
- The 5th layer contains a confidence score (probability of this grid cell containing any object and how well the bbox at this grid cell is positioned, from 0 to 1). If there is object inside, the score = IOU of predicted bbox and groundtruth.
- The next 5 layers serve the same purpose for the 2nd bbox.
- The last 20 layers is the probability of each object class at this grid cell as Pascal VOC dataset has 20 classes.
- So the output dimension is generically written as: S*S*(B*5+C) where S is the grid size, B is the no. of bboxes and C is the no. of classes.
Loss function
- It is a summation of 5 terms that have been well explained in another post here.
- 2 terms take care of the bbox location and size, 2 terms take care of the confidence score, 1 term takes care of the object classification.
YOLOv1 characteristics
Advantage
- Fast.
- Low false positive rate.
- Able to learn abstract object features.
Disadvantages
- Sacrifice accuracy for speed.
- Prone to bbox locating error.
- Not fantastic for small object detection (since the grid size 7x7 is pretty coarse).
YOLO v2
YOLOv2’s backbone is Darknet-19.
7 changes from YOLOv1
- Batch normalization included.
- Higher res classifier: Uses pretrained models that take in 448x448 image.
- Convolution with anchor boxes (instead of just 2 bboxes).
- Dimension cluster: use K-means to obtain the optimal no. of bboxes (resulted in B=5).
- Direct location prediction: As YOLO with anchor boxes itself is unstable for predicting x, y of bboxes, use sigmoid to bound value to 0–1 to determine if the bbox should even be at a certain location. So bboxes that are too far from the groundtruth will be set to 0.
- Fine-Grained features: object detection is done of a 13x13 feature map (instead of 7x7) to get finer details. At passthrough layer (similar to ResNet) to combine high res and low res features.
- Multi-scale training: YOLOv2 becomes a fully conv layers network, which make this possible: every 10 epochs, randomly choose different input sizes for training. (if FC layer is present, input size is fixed).
YOLO v3
- Backbone is Darknet 53.
- Different loss function: from softmax at v2 to logistic loss at v3.
- Different no. of anchor boxes from 5 at v2 to 3x3 at v3.
- Perform 3 detections. These detections are done at different feature maps of different sizes to detect features at different scales.