YOLO Explained

Ani Aggarwal
Dec 27, 2020 · 12 min read
Image for post
Image for post
Credit

What is YOLO?

YOLO or You Only Look Once, is a popular real-time object detection algorithm. YOLO combines what was once a multi-step process, using a single neural network to perform both classification and prediction of bounding boxes for detected objects. As such, it is heavily optimized for detection performance and can run much faster than running two separate neural networks to detect and classify objects separately. It does this by repurposing traditional image classifiers to be used for the regression task of identifying bounding boxes for objects. This article will only look at YOLOv1, the first of the many iterations this architecture has gone through. Although the subsequent iterations feature numerous improvements, the basic idea behind the architecture stays the same. YOLOv1 referred to as just YOLO, can perform faster than real-time object detection at 45 frames per second, making it a great choice for applications that require real-time detection. It looks at the entire image at once, and only once — hence the name You Only Look Once — which allows it to capture the context of detected objects. This halves the number of false-positive detections it makes over R-CNNs which look at different parts of the image separately. Additionally, YOLO can generalize the representations of various objects, making it more applicable to a variety of new environments. Now that we have a general overview of YOLO, let’s take a look at how it really works.

How Does YOLO Work?

YOLO is based on the idea of segmenting an image into smaller images. The image is split into a square grid of dimensions S×S, like so:

Image for post
Image for post
Credit: research paper.
Image for post
Image for post
Here is an example of an IOU: the area of intersection of the ground truth and predicted box in green divided by the area of the union of the two boxes, in purple. This will be between 0 and 1, 0 if they don’t overlap at all, and 1 if they are the same box. Therefore, a higher IOU is better as it is a more accurate prediction. Credit: image from research paper modified by me.
Image for post
Image for post
Credit: image from research paper modified by me.
Image for post
Image for post
Credit: image from research paper.

YOLO Architecture

The YOLO model is made up of three key components: the head, neck, and backbone. The backbone is the part of the network made up of convolutional layers to detect key features of an image and process them. The backbone is first trained on a classification dataset, such as ImageNet, and typically trained at a lower resolution than the final detection model, as detection requires finer details than classification. The neck uses the features from the convolution layers in the backbone with fully connected layers to make predictions on probabilities and bounding box coordinates. The head is the final output layer of the network which can be interchanged with other layers with the same input shape for transfer learning. As discussed earlier, the head is an S × S × (C + B ∗ 5) tensor and is 7 × 7 × 30 in the original YOLO research paper with a split size S of 7, 20 classes C, and 2 predicted bounding boxes B. These three portions of the model work together to first extract key visual features from the image then classify and bound them.

YOLO Training

As discussed previously, the backbone of the model is pre-trained on an image classification dataset. The original paper used the ImageNet 1000-class competition dataset and pre-trained 20 out of the 24 convolution layers followed by an average-pooling and fully connected layer. They then add 4 more convolutions to the model as well as 2 fully connected layers as it has been shown that adding both convulsions and fully connected layers increases performance. They also increased the resolution from 244 × 244 to 448 × 448 pixels as detection requires finer details. The final layer, which predicts both class probabilities and bounding box coordinates, uses a linear activation function while the other layers use a leaky ReLU function. The original paper trained for 135 epochs on the Pascal VOC 2007 and 2012 datasets using a batch size of 64. Data augmentation and dropout were used to prevent overfitting, with a dropout layer with a rate of 0.5, used between the first and second fully connected layers to encourage them to learn different things (preventing co-adaptation). There are more details available on the learning rate scheduling and other training hyperparameters in the original paper.

(w_{i} — \hat{w_{i}})²
(w_{i} — \hat{w_{i}})²
(\sqrt{w_{i}} — \sqrt{\hat{w_{i}}})²
(\sqrt{w_{i}} — \sqrt{\hat{w_{i}}})²
Image for post
Image for post
\sum_{i=0}^{S²} \sum_{j=0}^{B}
\sum_{i=0}^{S²} \sum_{j=0}^{B}
Image for post
Image for post
(w_{i} — \hat{w_{i}})²
(w_{i} — \hat{w_{i}})²
Image for post
Image for post

Limitations of YOLO

YOLO can only predict a limited number of bounding boxes per grid cell, 2 in the original research paper. And though that number can be increased, only one class prediction can be made per cell, limiting the detections when multiple objects appear in a single grid cell. Thus, it struggles with bounding groups of small objects, such as flocks of birds, or multiple small objects of different classes.

Image for post
Image for post
Here you can see that only 5 people in the lower left-hand corner are detected by YOLO when there are at least 8 in the lower left-hand corner. Credit: Image source

Conclusion

YOLO is an incredible computer vision model for object detection and classification. Hopefully, this article helped you understand how YOLO works at a high level. If you want to see the nitty-gritty details on a Python implementation, stick around: I will be publishing a follow-up blog on a PyTorch implementation of YOLO from scratch later, and following along with the code will be a great way to really test your understanding. And YOLO is only the first step in a larger project, a recurrent YOLO model which will further improve object detection and tracking across multiple frames, dubbed ROLO. Give me a follow to see the implementation of that, which will use recurrent networks in conjunction with YOLO. Thanks for reading, happy coding!

Links

  • The original research paper can be found as a pdf here
  • More info on the research paper and their other publication is available on their site here
  • All other sources are linked as they are used.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Ani Aggarwal

Written by

I am a high school student interested in AI and data science.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Ani Aggarwal

Written by

I am a high school student interested in AI and data science.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store