Object Detection YOLO v1 , v2, v3

7 min readJan 31, 2019

Object detection reduces the human efforts in many fields. In our case, we are using YOLO v3 to detect an object. YOLO v3 has DARKNET-53, with these 53 layers; model is more powerful to identify even the small objects from the image. YOLO v3 is able to identify more than 80 different objects in one image. YOLO v3 can brought down the error rate drastically. YOLO v3 uses thin sized boundary box.

How a YOLO work:

YOLO algorithm divides any given input image into SxS grid system. Each grind on the input image is responsible for detection on object. Now the grid cell predicts the number of boundary boxes for an object. [8]

For every boundary box has fiver elements (x, y, w, h, confidence score). X and y are the coordinates of the object in the input image, w and h are the width and height of the object respectively. Confidence score is the probability that box contains an object and how accurate is the boundary box.

Figure 1: bounding boxes source (Google)

YOLO algorithm:

It is based on regression where object detection and localization and classification the object for the input image will take place in a single go. This type of algorithms is commonly used real-time object detection.

Figure 2: Object detection and classification [9]

YOLO v1:

YOLO v1 architecture:

It uses Darknet framework which is trained on ImageNet-1000 dataset. This works as mentioned above but has many limitations because of it the use of the YOL v1 is restricted. It could not find small objects if they are appeared as a cluster. This architecture found difficulty in generalisation of objects if the image is of other dimensions different from the trained image. The major issue is localization of objects in the input image. [8].

Problem with YOLO v1:

Figure 4: From the above image, we can see that YOLO version1 have limitation based upon the closeness of object. As we can see, YOLO is detecting only 5 Santa’s from lower left corner, but there are 9 Santa’s.

YOLO v2:

The second version of the YOLO is named as YOLO9000 which has been published by Joseph Redmon and Ali Farhadi at the end of 2016. The major improvements of this version are better , faster and more advanced to meet the Faster R-CNN which also an object detection algorithm which uses a Region Proposal Network to identify the objects from the image input [1] and SSD(Single Shot Multibox Detector).

The changes from YOLO to YOLO v2:

Batch Normalization: it normalise the input layer by altering slightly and scaling the activations. Batch normalization decreases the shift in unit value in the hidden layer and by doing so it improves the stability of the neural network. By adding batch normalization to convolutional layers in the architecture MAP (mean average precision) has been improved by 2% [2]. It also helped the model regularise and overfitting has been reduced overall.

Higher Resolution Classifier: the input size in YOLO v2 has been increased from 224*224 to 448*448. The increase in the input size of the image has improved the MAP (mean average precision) upto 4%. This increase in input size is been applied while training the YOLO v2 architecture DarkNet 19 on ImageNet dataset. [3]

Anchor Boxes: one of the most notable changes which can visible in YOLO v2 is the introduction the anchor boxes. YOLO v2 does classification and prediction in a single framework. These anchor boxes are responsible for predicting bounding box and this anchor boxes are designed for a given dataset by using clustering(k-means clustering).[4]

Fine-Grained Features: one of the main issued that has to be addressed in the YOLO v1 is that detection of smaller objects on the image. This has been resolved in the YOLO v2 divides the image into 13*13 grid cells which is smaller when compared to its previous version. This enables the yolo v2 to identify or localize the smaller objects in the image and also effective with the larger objects. [4, 5]

Multi-Scale Training: on YOLO v1 has a weakness detecting objects with different input sizes which says that if YOLO is trained with small images of a particular object it has issues detecting the same object on image of bigger size. This has been resolved to a great extent in YOLO v2 where it is trained with random images with different dimensions range between 320*320 to 608*608 [5]. This allows the network to learn and predict the objects from various input dimensions with accuracy.

Darknet 19: YOLO v2 uses Darknet 19 architecture with 19 convolutional layers and 5 max pooling layers and a softmax layer for classification objects. The architecture of the Darknet 19 has been shown below. Darknet is a neural network framework written in Clanguage and CUDA. It’s really fast in object detection which is very important for predicting in real-time.

With the advancements in several categories in YOLO v2 is better, faster, and stronger as said by the [6]. With Multi-Scale Training now the network is able to detect and classify objects with different configurations and dimensions. YOLO v2 has seen a great improvement in detecting smaller objects with much more accuracy which it lacked in its predecessor version.

YOLO v3:

The previous version has been improved for an incremental improvement which is now called YOLO v3. As many object detection algorithms are been there for a while now the competition is all about how accurate and quickly objects are detected. YOLO v3 has all we need for object detection in real-time with accurately and classifying the objects. The authors named this as an incremental improvement [7].

Here we will have look what are the so called Incremental improvements in YOLO v3

Bounding Box Predictions: In YOLO v3 gives the score for the objects for each bounding boxes. It uses logistic regression to predict the objectiveness score.

Class Predictions: In YOLO v3 it uses logistic classifiers for every class instead of softmax which has been used in the previous YOLO v2. By doing so in YOLO v3 we can have multi-label classification. With softmax layer if the network is trained for both a person and man, it gives the probability between person and man let’s say 0.4 and 0.47. With the independent classifier gives the probability for each class of objects. For example if the network is trained for person and a man it would give the probability of 0.85 to person and 0.8 for the man and label the object in the picture as both man and person.

Feature Pyramid Networks (FPN): YOLO v3 makes predictions similar to the FPN where 3 predictions are made for every location the input image and features are extracted from each prediction. By doing so YOLO v3 has the better ability at different scales. [5]. As explained from the paper by [7] each prediction is composed with boundary box, objectness and 80 class scores. Doing upsampling from previous layers allows getting meaning full semantic information and finer-grained information from earlier feature map. Now, adding few more convolutional layers to process improves the output [7].

Darknet-53: the predecessor YOLO v2 used Darknet-19 as feature extractor and YOLO v3 uses the Darknet-53 network for feature extractor which has 53 convolutional layers. It is much deeper than the YOL v2 and also had shortcut connections. [6]. Darknet-53 composes of the mainly with 3x3 and 1x1 filters with shortcut connections.

Object detection reduces the human efforts in many fields. Object detection in real-time and accurately is one of the major criteria in the world where self-driving cars are becoming a reality. There is a lot of scope for the improvements in the object detection algorithms such as YOLO v3, faster R-CNN, SSD and many. Slightest improvements in these algorithms can change entire perception in real world.

REFERENCES:

[1]. Towards Data Science. (2018). R-CNN, Fast R-CNN, Faster R-CNN, YOLO — Object Detection Algorithms. [online] Available at: https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e [Accessed 6 Dec. 2018].

[2]. Towards Data Science. (2018). Batch normalization in Neural Networks — Towards Data Science. [online] Available at: https://towardsdatascience.com/batch-normalization-in-neural-networks-1ac91516821c [Accessed 5 Dec. 2018].

[3]. Medium. (2018). YOLOv3: A Huge Improvement — Anand Sonawane — Medium. [online] Available at: https://medium.com/@anand_sonawane/yolo3-a-huge-improvement-2bc4e6fc44c5 [Accessed 1 Dec. 2018].

[4]. Medium. (2018). (Part 1) Generating Anchor boxes for Yolo-like network for vehicle detection using KITTI dataset.. [online] Available at: https://medium.com/@vivek.yadav/part-1-generating-anchor-boxes-for-yolo-like-network-for-vehicle-detection-using-kitti-dataset-b2fe033e5807 [Accessed 2 Dec. 2018].

[5]. Medium. (2018). YOLOv3: A Huge Improvement — Anand Sonawane — Medium. [online] Available at: https://medium.com/@anand_sonawane/yolo3-a-huge-improvement-2bc4e6fc44c5 [Accessed 6 Dec. 2018].

[6]. Farhadi, A. and Redmon, J. (2016). YOLO9000: Better, Faster, Stronger.

[7]. Farhadi, A. and Redmon, J. (2018). YOLOv3: An Incremental Improvement.

[8]. Medium. (2018). Real-time Object Detection with YOLO, YOLOv2 and now YOLOv3. [online] Available at: https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-yolov2-28b1b93e2088 [Accessed 8 Dec. 2018].

[9]. Kdnuggets.com. (2018). Object Detection and Image Classification with YOLO. [online] Available at: https://www.kdnuggets.com/2018/09/object-detection-image-classification-yolo.html [Accessed 4 Dec. 2018].

Object Detection YOLO v1 , v2, v3

Written by Venkata Krishna Jonnalagadda