Yolo Object Detection Made Easy

Published in

Analytics Vidhya

6 min readApr 11, 2020

You Only Look Once(YOLO) is one of the state-of-the art technology in the field of Object Detection. The speed of the architecture is pretty fast that it is used in a number of Real-time applications.

So at first it may seems like a Rocket Science. You may want to skip all its theoretical explanations and just want to implement. But what if i say Understanding this Hack Is Easy???

So i will first list out all the important things you need to know before implementing YOLO how it works/ how the bounding boxes are decided/ los etc. etc. etc.

And Second and the one of the important it’s implementation especially what you should not do?? and how to do!!

So lets start Guys!! Be Ready 👏👀

So first of all Yolo has its 3 versions

What all its versions makes interesting about it that the Network’s output that is the grid. The convolutions enable to compute the predictions at different positions in the image in an optimized way.

The most important feature of v3 is that it makes predictions at 3 different scales. A call can have multiple number of bounding boxes.

It Intakes an image and divides it in grid of S X S (where S is a natural number)

So the shape of one detected kernel is 1×1×((B×5)+C) and for the complete image it is S×S×((B×5)+C).

Where

→ S×S is the the number of images in which Yolo divides the input.

→ B is the number of Bounding Boxes on a cell.

→ C is the number of Classes

and the reason behind 5 is that for 1 cell there are four bounding box attributes and one object confidence.

Anchor Boxes: This is one of the term that you will hear out while implementing Yolo and reading its documentation.Anchor boxes are a set of predefined bounding boxes of a certain height and width. These boxes are defined to capture the scale and aspect ratio of specific object classes you want to detect and are typically chosen based on object sizes in your training datasets.

Yolo v3, in total uses 9 anchor boxes. Three for each scale.

So when any object detection algorithm works it predicts a number of Bounding boxes even on the same object. The output is in a format like:

Object 1 : (X₁, Y₁, Height₁, Width₁), Classₓ

Object 2 : (X₂, Y₂, Height₂, Width₂), Classₓ

and it goes so on…….

So it’s the responsibility of algorithm to decide which bounding box is identifying which object. May be two bounding boxes are identifying the same object. It is also possible that two bounding boxes at nearly same coordinates are detecting different objects.

To perform this there are many techniques and one of them is IOU (Intersection Over Union). According to these the confidence scores are calculated for the retained bounding boxes.

IOU: It can be easily understood by the image below. It defines the ratio of Area of Overlap to Area of Union.

mAP : mean Average Precision it is used as an evaluation criteria for Yolo. mean value of average precisions for each class, where average precision is average of 11 points on PR-curve for each possible threshold (each probability of detection) for the same class.

More Bounding Boxes per image🤔

Yolo v3 predicts more bounding boxes than it’s previous version (which also makes a bit slower than that). It is because it predicts boxes at 3 different sales. Yolo v3 predicts around 10× bounding boxes as compare to the v2 architecture.

Understanding its Loss Function

It seems a bit intimidating. Lets break it out in pieces to be more clear.

👉 So the first term penalizes the coordinated predicted for the bounding box. The second one is for the prediction of height and width of the object. And the last one penalizes the class for which it predicts the object.

In Yolo V2 these terms are used as the sum of squared errors. But when it comes to the V3 architecture it is calculated like log-loss in Logistic Regression.

So these were some of the important concepts regarding Yolo. So now lets jump to the implementation part.

Implementation

I will cover the Darknet implementation here. There are also implementations like Yolo-Darkflow, Yolo Keras.

So first and also important of all there are two github repositories for the Darknet model one is of pjreddie (which is the official one) and the other one is AlexyAB.

So clone the second one if you want all functionalities to work. You will find many commands on internet that only works with the AlexyAB’s repo. It is modified according to the needs and usage.

So the best way to use Yolo is to first read the complete documentation of the Repository else you can stuck in between and will curse yourself🤐.

I will list out some of the important things which he mentioned there.

So the first thing is you have to label your dataset in Yolo format. There are a lot of tools to do that and one of them is labelimg its pretty simple to install and use.

First clone the repository in your system(You can also use Colab as it will give you a good GPU support)

Create file yolo-obj.cfg with the same content as in yolov3.cfg (or copy yolov3.cfg to yolo-obj.cfg) and:

change line batch to batch=64
change line subdivisions to subdivisions=16
change line max_batches to (classes*2000 but not less than 4000), f.e. max_batches=6000 if you train for 3 classes
change line steps to 80% and 90% of max_batches, f.e. steps=4800,5400
set network size width=416 height=416 or any value multiple of 32:
change line classes=80 to your number of objects in each of 3 [yolo]-layers
change [filters=255] to filters=(classes + 5)x3 in the 3 [convolutional] before each [yolo] layer, keep in mind that it only has to be the last [convolutional] before each of the [yolo] layers.

So if classes=1 then should be filters=18. If classes=2 then write filters=21.

2. Create file obj.names in the directory build\darknet\x64\data\, with objects names - each in new line

3. Create file obj.data in the directory build\darknet\x64\data\, containing (where classes = number of objects)

4. Put image-files (.jpg) of your objects in the directory build\darknet\x64\data\obj\

You should place your train.txt file and test.txt files in /data folder

5. Download pre-trained weights for the convolutional layers and put to the directory build\darknet\x64

6. Start training by using the command line: ./darknet detector train data/obj.data yolo-obj.cfg darknet53.conv.74

So the best thing is that weights are saved regularly in the backup folder for every 100 iterations are saved in yolo-obj_last.weights file. So if you lose your session due to any problem you can continue from the saved weights using command.

./darknet detector train data/obj.data yolo-obj.cfg backup/yolo-obj_last.weights

When to stop

Usually sufficient 2000 iterations for each class(object), but not less than 4000 iterations in total. But for a more precise definition when you should stop training, use the following manual:

During training, you will see varying indicators of error, and you should stop when no longer decreases 0.XXXXXXX avg:

Region Avg IOU: 0.798363, Class: 0.893232, Obj: 0.700808, No Obj: 0.004567, Avg Recall: 1.000000, count: 8 Region Avg IOU: 0.800677, Class: 0.892181, Obj: 0.701590, No Obj: 0.004574, Avg Recall: 1.000000, count: 8
9002: 0.211667, 0.60730 avg, 0.001000 rate, 3.868000 seconds, 576128 images Loaded: 0.000000 seconds

9002 — iteration number (number of batch)
0.60730 avg — average loss (error) — the lower, the better

You can find a implementation of yolo here for your reference.

Till then Happy Learning!!

Please Clap if you like it👏👏

Yolo Object Detection Made Easy

More Bounding Boxes per image🤔

Understanding its Loss Function

Implementation

When to stop

Written by Nitin Chauhan