# [Paper Review] 5. YOLO ver 1

# 1. Contribution of YOLO

*YOLO*views the object detection task as detecting a spatially separated bounding box and its corresponding class distribution.- Further, unlike previous researches, it uses
**a single neural network architecture**to predict bounding box locations and the associated class probabilities**simultaneously (One-stage, end-to-end)**. *YOLO*archives a real-time evaluation speed (45 frames per second)- YOLO has a relatively weak localization performance. However, it produces fewer false positives on background.

## 1.1 Why are previous networks slow?

This is due to the nature of their architectures of DPM (Deformable Parts Models) or R-CNN. In DPM, a sliding window approach is applied to the every-evenly-divided region to search objects over an image, which is time-consuming and cost-expensive. Similarly, R-CNN introduces a region proposal method that proposes potential locations of objects. Then those locations are fed into a classifier for making a class prediction. Subsequently, there is a post-processing step, such as NMS(non-maximum suppression), to delete duplicate predictions. As one can guess, this evaluation pipeline of R-CNN is complex and slow.

The authors of *YOLO* proposed a simpler and intuitive approach to address this issue.

## 1.2 Unifying regression and classification tasks into one.

Unlike R-CNN which is a representative network of two-stage approaches, *YOLO*, as is shown in *Fig* 1, has **only a single convolutional network **that is responsible for both bounding box regression and classification, simultaneously.

By using the described architecture, we can obtain two benefits.

**(1) the pipeline becomes simpler and remarkably fast since the complex RoI(region of interest) proposal step is not required.**

**(2) Unlike sliding-window-based methods, YOLO can reason on global context in image.**

Since sliding-window methods only take a look at individual-and-isolated regions, they often lose the global context of objects in image (See Detector’s view in *Fig 2*). Further, not all objects are in box shape, meaning that a window what detector views is likely to contain unnecessary pixels that are not helpful to describe an object.

*YOLO*, on the other hand, can compensate the disadvantage of sliding-window methods losing the global context by piling up many convolutional layers, increasing the receptive field.

# 2. YOLO Architecture

Now, let’s take a look at how YOLO achieves its advantages in detail.

By passing a 448⨯448 image through Conv and FC layers, we eventually retrieve 7⨯7 (or ** S**⨯

**depends on image size and FC weights) feature maps with 30 (5×**

*S,***+**

*B***, where the number of bounding box predictions**

*C***= 2 and number of classes**

*B***= 20) channels. Since each grid point (pixel) at the last feature map is responsible for detecting an object in a roughly 64⨯64 region of the original input image,**

*C***the global context is ensured (not perfectly, but catching it better than other sliding window approaches).**

## 2.1 Regression

More specifically speaking regarding the bounding box localization of *YOLO*, each grid point predicts B bounding boxes and associated confidence scores of objectness (Foreground or Background), meaning that a predicted bounding box can be parameterized by ** (x, y, w, h, conf_score)**.

** (x, y)** indicates the center of the box relative to the bounds of the grid cell, and

**are width and height relative to the image size. Lastly the confidence score of objectness is defined as follows:**

*(w, h)*indicating that if an object in image falls into a grid cell, the predictive confidence score should be the intersection over union (*IOU*) between the predictive bounding box ** (x,y,w,h)** and its corresponding GT box. Otherwise, the confidence score should be zero.

## 2.2 Classification

Each grid cell at the last 7⨯7 feature map predicts not only regression but also classification parameters, which are the conditional class distribution over ** C **classes:

This implies the probability of an object belonging to a certain class given the fact that the object exists in the grid cell.

## 2.3 Test time

In order to retrieve a class-specific confidence score at each grid cell, the objectness confidence score of individual bounding box and the conditional class probability are multiplied, as below.

In short, the described process above can be visualized in Fig 4.

For boosting your understanding, see the following description for each probability term in *YOLO*.

## 2.4 Loss function

The loss function of YOLO basically consists of squared errors, which has a advantage of being easily optimized; Briefly, the function is sum of bounding box regression, objectness, and classification error, where each of those errors are squared, as is shown below.

Here are explanations for each term in the loss function above.

**(1) BBOX regression loss for center coordinates (x, y)**: Sum of squared error for

*x*and

*y*coordinates. Note that the

*(x, y)*are the parameterized offset to i-th grid cell. This is to reduce the search space of regression task, and has a similar effect as normalization as both (

*x, y)*fall between 0 and 1 by doing so.

**(2) BBOX regression loss for width and height coordinates (w, h)**: Similar to (1), but there is one notable difference: square root of

**and**

*w***in squared residual error. As small deviations are more critical when it comes to predicting location of small GT object than that of large object. Note that**

*h***and**

*w***are relative coordinates to image size, therefore they are in the range of [0, 1].**

*h***(3) Objectness confidence score for positive(matched) bbox**: If an object exists in the i-th grid cell, and it is matched to j-th predictive bbox, then the objectness confidence score should be one.

**(4) Objectness confidence score for negative(un-matched) bbox**: Opposite case of (3), and therefore, the objectness confidence score should be zero.

**(5) Conditional class probabilities**: If *an *object exists in i-th grid cell, the classification loss at that cell is the squared error of the class conditional probabilities over all classes.

## 2.5 NMS

After training and while inferencing, many overlapping and duplicate predictions are produced due to the nature of YOLO design, where multiple predictive bounding boxes are made at each grid cell.

To compensate this behavior, a famous Non-maximal suppression is applied, which is visualized in *Fig 5*.

**Brief process of NMS is described below:**

(1) For bounding boxes of each class, sort them by confidence score.

(2) Compute IOU between bounding boxes.

(3) If two bounding boxes have IOU higher than 0.5, remove the one with lower confidence score.

(4) Repeat above steps for all classes.

# 3. Limitation

(1) Only one set of class probabilities are predicted at a grid cell regardless of the number of predictive bounding box.

(2) Speed-accuracy trade-off

(3) No prior shape of objects (anchor box) are defined

# 4. Reference

[1] YOLO

[2] CS 376: Computer Vision of Prof. **Kristen Grauman**

[3] https://medium.com/@venkatakrishna.jonnalagadda/object-detection-yolo-v1-v2-v3-c3d5eca2312a

[4] https://jonathan-hui.medium.com/real-time-object-detection-with-yolo-yolov2-28b1b93e2088

**Any corrections, suggestions, and comments are welcome**