Learning Day 66: Object detection 5 — YOLO v1, v2 and v3

De Jun Huang
dejunhuang
Published in
4 min readJun 20, 2021

YOLO v1 (You Only Look Once)

  • In previous object detection algorithms (eg. Faster R-CNN, R-FCN), there are two problems to solve: classification and regression
  • In YOLO, changed to a pure regression problem
  • Use a neural network to directly predict 1) bounding box (bbox) location, 2) probability of including an object, and 3) probability of a certain class.
  • Use NMS (Non-Maximal Suppression) to filter down boxes

Network architecture

YOLO v1 network architecture (ref)
  • The pretrained model with ImageNet has input size of 224x224, but is then resized to 448x448 for YOLOv1
  • It contains both conv and FC layers, unlike R-FCN
  • The final output is 7x7x30 (for Pascal VOC dataset which determines the last dimension 30. Will be explained later)

How to interpret output 7x7x30

  • The first two dimensions, 7x7, indicate that YOLOv1 has conceptually divided the input image into 7x7 grid cells.
Illustration of grid cell and bboxes (ref)
  • In each grid cell, its depth is 30 which is the third dimension of 7x7x30
  • The depth of 30 includes three components: bbox1 (depth=5), bbox2 (depth=5) and number of classes (depth=20)
Visualisation of output of YOLO v1 (ref)

Bbox1 & 2 (depth=5 each)

  • YOLO v1 uses two bboxes per grid cell.
  • First 4 layers contain the bbox1 locations, width and height: x, y, w, h.
  • The 5th layer contains a confidence score (probability of this grid cell containing any object and how well the bbox at this grid cell is positioned, from 0 to 1). If there is object inside, the score = IOU of predicted bbox and groundtruth.
  • The next 5 layers serve the same purpose for the 2nd bbox.
  • The last 20 layers is the probability of each object class at this grid cell as Pascal VOC dataset has 20 classes.
  • So the output dimension is generically written as: S*S*(B*5+C) where S is the grid size, B is the no. of bboxes and C is the no. of classes.

Loss function

  • It is a summation of 5 terms that have been well explained in another post here.
  • 2 terms take care of the bbox location and size, 2 terms take care of the confidence score, 1 term takes care of the object classification.

YOLOv1 characteristics

Advantage

  • Fast.
  • Low false positive rate.
  • Able to learn abstract object features.

Disadvantages

  • Sacrifice accuracy for speed.
  • Prone to bbox locating error.
  • Not fantastic for small object detection (since the grid size 7x7 is pretty coarse).

YOLO v2

YOLOv2’s backbone is Darknet-19.

7 changes from YOLOv1

  1. Batch normalization included.
  2. Higher res classifier: Uses pretrained models that take in 448x448 image.
  3. Convolution with anchor boxes (instead of just 2 bboxes).
  4. Dimension cluster: use K-means to obtain the optimal no. of bboxes (resulted in B=5).
  5. Direct location prediction: As YOLO with anchor boxes itself is unstable for predicting x, y of bboxes, use sigmoid to bound value to 0–1 to determine if the bbox should even be at a certain location. So bboxes that are too far from the groundtruth will be set to 0.
  6. Fine-Grained features: object detection is done of a 13x13 feature map (instead of 7x7) to get finer details. At passthrough layer (similar to ResNet) to combine high res and low res features.
  7. Multi-scale training: YOLOv2 becomes a fully conv layers network, which make this possible: every 10 epochs, randomly choose different input sizes for training. (if FC layer is present, input size is fixed).

YOLO v3

  • Backbone is Darknet 53.
  • Different loss function: from softmax at v2 to logistic loss at v3.
  • Different no. of anchor boxes from 5 at v2 to 3x3 at v3.
  • Perform 3 detections. These detections are done at different feature maps of different sizes to detect features at different scales.
3 detections are performed at 3 different feature maps of different sizes to detect features at different scales (ref)

Reference

Link1

--

--