Prototyping YOLO -First step towards computer vision

Published in

towardsdatascience

7 min readApr 3, 2020

--

In this post, I will try to walk through all the key steps in YOLO. But the main emphasis is upon building input data (images) and target labels stack. Loss function in YOLO is a multi-parameter regression problem, understanding it and preparing data is the key for successful implementation.

Introduction

YOLO (You Only Look Once) was first published in 2015 by Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi. Since then YOLO is a popular computer vision algorithm. Unlike most other algorithms in computer vision, YOLO employs the same network to carry out all mapping functions and detects multiple objects and bounding boxes. YOLO can process images significantly faster, reaching 45 fps or more.

Motivation

YOLO paper is long and complex. Most implementations use libraries to extract and load data. And often different implementation uses slightly different loss functions. However, it is quite possible and much more interesting to implement from scratch.

Another interesting aspect of YOLO is, it optimizes the same network to do multiple tasks at the same time, succinctly illustrating that neural networks are basically highly flexible mapping function(s) that can be optimized over a number of constraints to find the desired solution(s).

Getting Data

For this write up I am going to use PASCAL VOC 2007 Object Classes. The first step is to download and extract the tar file. The extracted file is a list of info, .xml files, and images. Ground truths corresponding to images are then extracted from .xml files by navigating through the XML tree.

Pre-processing (preparing data for training)

As mentioned above, this is the key step in implementing YOLO. Even though image detection is a major goal, YOLO is set up as regression problems across multiple variables, therefore matching image with all key numbers (targets) is the backbone of this architecture.

The bounding box in the data is given as coordinates for the upper left and lower right corner. Besides, images are of different sizes. We need to resize the images to some size (for example 416 x 416 here) and find the coordinate for the center of the object and height and width with respect to resized image. All metrics are again computed with respect to the image size such that all co-ordinates and sizes range from 0 to 1. For example, the object of size 208 x 208 in the resized image will have a final size of 0.5 x 0.5. Pandas is an excellent tool to get it done.

The following code transforms the bounding box data from pixel size to ratios with respect to resized image. These are then used as targets to compare with model outputs while training.

In the following illustration, there are 5 objects in an image (000005.jpg) with respective center coordinates (x_r and y_r) and sizes (w_r, h_r).

Once we have found location and sizes for each object in all images. The next task is to decide the anchor boxes. For sake of simplicity, we will use just 2 anchor boxes, we will consider the scenario where the images are either wide or tall. Finally, we will have to decide about the number of grids each image will have. For this exercise, we will keep 3 x 3 or 9 cells per image. Therefore each image will have final 3 x 3 x 2 = 18 target vectors or ground truth (2 vectors for each grid cell) and the output of our network will also contain 3 x 3 x 2 = 18 vectors for each image. Next, we need to create a dictionary defining image category as integers and update our table. The table below contains all the information needed to compile ground truth (targets) for each image.

We can visualize transformed sizes in resized images to verify that all numbers are indeed correct.

Images are re-scaled (416 x 416), Object names are printed at the center, the number above it represents the orientation of anchor boxes(0 → tall, 1 → wide) and integer below label is grid address (note that — there are 9 grids, numbered 0 to 8, going from left to right and down, starting at top left).

We are done with preparing inputs and targets. Next task is to organize the input information, particularly when there are no objects in a grid, all the values will be filled with 0. And when any grid is marked as having no object (0), the rest of the information is ignored or empty cell is not used for further optimization while training the model. Code below helps to do that.

Finally, we have a tensor for each image. It has 18 stacks of vectors, 2 for each grid (each grid has 2 possible images -horizontal or vertical) as seen below.

A tensor representing target values for an image. The first column signifies the presence or absence of an object in that particular spot. Center coordinates (2nd and 3rd column), width and height (4th and 5th column) and object category (6th column) are available and relevant only when an object is present in the selected grid/anchor position.

YOLO Architecture

While YOLO uses darknet we will use darknet style simple residual network to train and predict. As illustrated below network accepts a (resized) image and produces a grid sized output (with some arbitrary number of channels). Fully connected layers at the end are used to extract vectors of appropriate length. For example, our network will produce a 3 x 3 grid, each grid can have 2 anchors, each anchor holds 25 pieces of information, hence the output of fully connected layer is 9 x 2 x (5 + 20) = 450. Also, note that each object category is represented here as a one-hot vector with 20 dimensions.

source: https://lilianweng.github.io/lil-log/2018/12/27/object-detection-part-4.html

The image below illustrates how select vectors in the target tensor/matrix represent the information (objects, and it’s bounding boxes) in each image. The object category is unpacked as 20-dimensional one-hot vectors. All parameters in the network are optimized using regression of respective pairs.

The resized image is input to the convolutional neural network, followed by a fully connected network. The sigmoid and softmax activation at the end is used to get the scores similar to probabilities or class probabilities however the whole set of scores goes through regression while optimizing the parameters. For a detailed explanation of loss function refer to this and this article.

Training and Testing:

It is much better to use transfer learning rather than training from scratch. But for this example, we shall quickly build a small network to do the job. It is basically a residual network with few layers. The Conv-net accepts a batch of images of size 416 x 416. It has 5 stages, the first stage bit different at it has much large kernel size. After series of stages, the channels get bigger (3–64–256–512 -1024) and through the use of kernels and pooling, the size of output reaches the desired size of 3 x 3. Final fully connected layer outputs size 450, to be reshaped as 9 x 2 x 25 (or 18 vectors of size 25 for each image).

The following picture illustrates the whole equation and loss function is simply the sum of all individual pieces. If necessary please click on the source of the following image for a detailed explanation.

source: https://towardsdatascience.com/yolov1-you-only-look-once-object-detection-e1f3ffec8a89

Below is code for building network, defining optimizer and batching.

For training, we need to batch the images along with matching targets. Similarly, IOUs (Intersection Over Union) is a key concept to find confidence score. It is simply the degree of overlap between predicted bounding box for an image to the actual one(ground truth). IOUs help to improve the quality/accuracy of bounding boxes particularly when it outputs high confidence. Note that IOUs are used again to select images while filtering the bounding boxes in test images. Here I am computing this 2 IOUs bit differently. Computing IOUs for test images is more stringent and it also considers the exact location of each frame and this information used to find the exact overlap. Image below illustrates IOUs.

Getting IOUs

Finally train:

Testing and using YOLO

Once training is done, we can use the model to get the predicted tensor (from which we will extract each vector corresponding to the confidence level, boxes and object category). Steps to get predicted boxes are post-processing or applying sigmoid and softmax function to get scores in appropriate format (as used in loss function), discarding boxes with low confidence, applying Non-Max Suppression and finally get pixel values (coordinates) from the ratios to print on top of the image.

Non-Max Suppression(NMS) is the key concept in filtering out overcrowded boxes. Redundant boxes appear as neighboring grids faultily assume/detect an object and also predict a separate box. Steps use for NMS.

Group the predicted objects according to class or categories.
If an object occurs only once -filtering is not required
For the rest of the groups -Sort according to the confidence level and start filtering comparing one with the highest confidence.
Find overlap/IOU with every other image in the same category, if there is significant overlap (for example >0.4) discard those predictions/boxes.
Repeat the process till none of the predictions have significant overlap

Before and after NMS. Note- network is still training and boxes are bit off.

Code for putting YOLO to use.

Reference:

[1] You Only Look Once: Unified, Real-Time Object Detection (Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi, 2015)

[2] Real-time Object Detection with YOLO, YOLOv2 and now YOLOv3 (Jonathan Hui, 2018)

[3] YOLO Algorithm (Andrew NG, 2017)

Prototyping YOLO -First step towards computer vision

Written by Munesh Lakhey