YOLO : You Only Look Once.

Atharva Gundawar
4 min readDec 6, 2019

--

Although Computer Vision (CV) has only exploded recently,

(The breakthrough moment happened in 2012 when AlexNet won ImageNet), it certainly isn’t a new scientific field.

Computer vision has been with us for a long time now…..but why is is that it boomed so much so lately?

lets first talk about the history of image processing and computer vision

First stop The Viola-Jones algorithm is a widely used mechanism for object detection. The main property of this algorithm is that training is slow, but detection is fast. … Then the detection window is moved across the image like the following:

In reality, they had hand-coded the locations of all the facial features and their relations with each other, obviously, the model wasn’t really good because it didn’t work even if the image was rotated by any arbitrary angle.

So let's skip all the other algorithms after that and jump to the one which changed the world.

YOLO:

So in layman's terms what this algorithm does is that it takes a picture and firstly draws random boxes of different sizes all across the given image…

Once this is done using its libraries it finds the probabilities of each box to have an object inside them

When this is done it processes these images with its own algorithm and the probabilities of a given box to have an object is given out with the detected objects name.

Network Architecture

YOLO model network architecture

The model consists of 24 convolutional layers followed by 2 fully connected layers. Alternating 1×1 convolutional layers reduce the features space from preceding layers. (1×1 conv has been used used in GoogLeNet for reducing number of parameters.)

Fast YOLO fewer convolutional layers (9 instead of 24) and fewer filters in those layers. The network pipeline is summarized like below:

Whole Network Pipeline

Loss Function

Loss Function

There are 5 terms in the loss function as shown above.

1. 1st term (x, y): The bounding box x and y coordinates is parametrized to be offsets of a particular grid cell location so they are also bounded between 0 and 1. And the sum of square error (SSE) is estimated only when there is an object.

2. 2nd term (w, h): The bounding box width and height are normalized by the image width and height so that they fall between 0 and 1. SSE is estimated only when there is an object. Since small deviations in large boxes matter less than in small boxes. the square root of the bounding box width w and height h instead of the width and height directly to partially address this problem.

3. 3rd term and 4th term (The confidence) (i.e. the IOU between the predicted box and any ground truth box): In every image, many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects, and makes the model unstable. Thus, the loss from confidence predictions for boxes that don’t contain objects is decreased, i.e. λnoobj=0.5.

4. 5th term (Class Probabilities): SSE of class probabilities when there are objects.

5. λcoord: Due to the same reason mentioned in the 3rd and 4th terms, λcoord = 5 to increase the loss from bounding box coordinate predictions.

Although YOLO is based on hard-coding and not on deep learning it has its similarities to haar-cascade classifiers, that being said it was the pioneer which took computer vision to the next level.

Sources:

1> https://sergioskar.github.io/blog/

2> https://towardsdatascience.com/yolov1-you-only-look-once-object-detection-e1f3ffec8a89

  • Atharva Gundawar

#computervision #yolo

--

--