All About YOLO Object Detection and its 3 versions (Paper Summary and Codes!!)

Mehul Gupta
Data Science in your pocket
6 min readSep 6, 2019

--

OVERVIEW:

Object Detection has been amongst the hottest streams in Data Science. A lot of models have been explored and gained tremendous success. But the first & foremost that comes to our mind is YOLO i.e. You Look Only Once. Due to its tremendous speed, it has found its application in a number of live applications. The name itself explains a lot about itself, that it just goes through the entire image just once!! Hence we will be exploring how YOLO works and a comparative study of its different versions.

ABOUT YOLO:

YOLO has been amongst the most popular Object Detection Algorithm. Some important things to know are:

  • It Intakes an image and divides it in a grid of S X S (where S is a natural number)
  • Each pixel in the image can be responsible for a finite number of (5 in our case) bounding box predictions. A pixel is taken responsible for prediction when it is the center of the object detected. Out of all detected boxes, It is taken responsible for the detection of only one object and other detections are rejected.
  • It predicts C conditional class probabilities (one per class for the likeliness of the object class).

Total detections to be done per image=S X S((B*5)+C)

Where

S X S= Total number of images yolo divides the Input

B*5=B is the Number of Bounding boxes Detected all over the image(Without any threshold consideration).For each bounding box, 5 elements are detected:

Detected Objects Centre coordinates(x,y),

Height and Width

Confidence score.

C=Conditional probability for Number of Classes.

Hence if the image is divided in a 2 x 2 grid, and 10 boxes are predicted with 3 classes(Dog, Cat, Mouse), we will have 2*2(10*5+3) predictions=212 predictions

  • Two thresholds are taken into consideration for selecting final detection
  1. IOU threshold: The IOU threshold can be better understood using the below image:

Here the two boxes represent Predicted and True object

2. Confidence threshold: Threshold for minimum confidence the model has on a detected object(box confidence score)

  • Below are certain scores calculated over the detected object.

P(object) is 1 if a box is detected else 0.We must make this clear that IoU isn’t the IOU threshold mentioned above but the predicted IoU.

Hence the entire process of detection can be summed up as:

  • Dividing the image in a grid of S X S
  • Detecting bounding box
  • Calculating various scores(mentioned in the above image)
  • Depending on the threshold for confidence and IoU, select some boxes
  • Use Non-Max suppression: It helps us to avoid duplicate detections of the same object by rejecting multiple predictions for the same object. It takes a list of predicted boxes for the same image and accepts detection with maximum confidence.
  • If the detection is over both IOU and Confidence threshold, It is taken as the final prediction

mAP:

Mean Average Precision is the metric used for YOLO’s evaluation which can be explored here.

Loss function:

It has 4 major parts:

  • Error in X, Y coordinates of the detected object
  • Error in Height, width of the detected object(root is taken to give less weightage)
  • Error in Classification. Lambda has been used to give more weightage to Confidence error.
  • Error in confidence of object detected(similar to log loss)

Different Versions

YOLO requires a Neural Network framework for training and for this we have used DarkNet.The first version has 26 layers in total, with 24 Convolution Layers followed by 2 Fully Connected layers. The major problem with YOLOv1 is its inability to detect very small objects.

After the first version, 2 more versions for YOLO released that are:

YOLO9000 / YOLOv2:

  • Inclusion of batch Normalization layers after each Conv Layer
  • It has 30 layers in comparison to YOLO v1 26 layers.
  • Anchor Boxes were introduced.

Anchor boxes are predefined boxes provided by the user to Darknet which gives the network an idea about the relative position and dimensions of the objects to be detected. It has to be calculated using the training set Objects.

  • No fully connected layer present
  • Random dimensions were taken for training images ranging from 320–608
  • Multiple labels might be provided to the same objects, but still a multiclass problem(WordTree concept) i.e either the parent or child be the final label and not both.
  • Still bad with small objects

YOLOv3:

  • 106 layers neural network
  • Detection on 3 scales for detecting objects of small to very large size
  • 9 anchor boxes taken; 3 per scale. Hence more bounding boxes are predicted than YOLO9000 & YOLOv1
  • MultiClass problem turned in MultiLabel problem
  • Certain changes in the Error function.
  • Quite good with small objects

Implementation:

You can find out Vehicle Number Plate Detection using YOLOv3 and Darknet at the below link:

Vehicle_Number_Plate_Detection for your reference

Some brief description of the major changes you need to do for custom training:

  • Label your training dataset as given below:
  • For multiple objects, take the same format for all objects in the same file, one below the other with correct Label_ID.Also, the text file has the same name as the image name.
  • For creating a labelled dataset, you might need to know YOLO_MARK.

Once done with the dataset,

Follow the below steps(all this within yolo-obj.cfg in cfg folder)

  • Change the ‘filter’ values as mentioned below not for all variables but the variable just above ‘classes’ as shown below.
  • Create train.txt and test.txt having path_location for train/validation images.
  • Once done with this, download the initial weights for training from the below link:
  • Now, we are ready with all our guns loaded. now we need to place these stuff at their right places:

Codes to execute darknet for different purposes(for Linux):

  • ./darknet detector train data/obj.data yolo-obj.cfg darknet53.conv.74 -dont_show >> yolo_rotate_1.log

This trains your data using obj.data,yolo-obj.cfg and initial weights downloaded. -dont-show helps us logging.

  • ./darknet detector test data/obj.data yolo-obj.cfg backup/yolo-obj_final.weights -ext_output -dont_show -out result.json < data/valid.txt

This helps us to test our model using final_weights stored in the backup folder.-ext_output gives us the output coordinates of the detected objects and data/valid.txt is the validation data

  • ./darknet detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416

This helps us to calculate anchors using the training dataset

  • Remember that the paths used in train.txt,test.txt should be relative to the darknet folder.
  • Do get a GPU else your life would be quite hectic. If you get one, set GPU=1 in makeFile
  • If GPU is unavailable, set CUDNN and CUDA as 1 for faster training on CPU(not at all fast😅)

Conclusion:

YOLO has its ups and its downs. The trained model size goes up to 250MB which can be considerable. It has some exceptional performance over large objects but still requires some modification for small objects. Though I would suggest you use YOLOv3 and let the detection game begin!!!

--

--