Object Detection State of the Art-YOLO-V3

Published in

Analytics Vidhya

8 min readJul 18, 2021

Problem Formulation:

sometimes we want to find out what are the contents inside an image so that we can use this information for various reasons for example: if we know what objects are inside an image, what is its extent, and its accurate location we can use it for multiple tasks like classification of different type of objects in a single image, this comes very handily when you are to design a self-driving car. In recent years we have achieved better accuracy for such complex tasks in recent years using the deep learning model. In this blog, we are going to have a look at one such model called YOLO(You Only Look Once) and its 3rd version.

the task that object detection tries to solve is detecting multiple objects in a single image and localize those objects by drawing a rectangle over them called bounding boxes. a single image could have multiple objects and they have multiple bounding boxes too. have a look at the image for better understanding.

object detection. source- google image search.

So the task that our model solves is given an input image it should return us the bounding boxes for each object which is objection and localization and the second task is of classification which tells its corresponding class label, the class labels will differ from problem to problem.

YOLO-V3 model details:

Now let's have a look into the yolov3 model insides. this model constitutes of many parts, but first, let's talk about the backbone network that is used which is also called a feature extractor, its used to extract important features for object localization and its classification. It's a fully connected convolution network(FCN) which means it does not have dense layers or max-pooling layers. In earlier versions of this model, they have used VGG and ResNet as the backbone but in the Yolo V3, they used an FCN called DarkNet-53. Which is a 53 layered fully convolution network below is the image of the same.

The typical input size of the image inside the model is (416x416x3) which is fed into the model.

Convolution Block:

the convolution block here means it has convolution operation of the filter size mentions beside each block and the size of each kernel is also defined, if in the kernel size you see ‘3 x 3 / 2’ it means the stride is equal to 2 because we don't use max-pooling operation here we have to use stride operation to reduce the image size you will also notice that the size in output column the size of the image has also reduced the factor of 2

After the convolution operation, it is passed to the BatchNormalisation layer followed by LeakyRelu activation.

Residual Block:

the residual block idea is derived from ResNet, in which they use residual layers which allow the gradients to skip through the convolution operation if they don’t provide much information. by doing it the flow of gradient becomes smooth and all the non-essential information is ignored by the network.

in the darknet, the residual block works as follows the as you can see that there are blocks inside a bigger block lets name it mega-block so inside this mega block there are 2 convolutional blocks of different kernel sizes (1,3) and the number of filters is (32,64) respectively after this there is a residual block.

So the input that comes inside our mega-block is of shape 128*128*64 from the above figure and its passed to the 2 convolution blocks after that the image size is changed into 128*128*64 so what residual block does is it just concatenates the input that was received into our mega-block (128*128*64) with the output of second convolution block which will output the tensor of the same shape as received (128*128*64). so it's adding a shortcut connection.

you might have noticed ‘1x,2x,4x,8x’ denoted outside our mega-block. this means that that whole mega block is repeated that number of times. because the input tensor and output tensor shapes are the same we can repeat the operation without any error.

so once the image is passed into the model the image size is shrunk by 32, because as you see after each block we are applying a convolution block with a stride of 2 five-time so that gives us 2⁵=32. if the image shape passed is 416*416*3 then the extracted features shape will be 13*13*1024 (1024 is the number of filters inside the last mega block).

so now we have successfully extracted the features from the backbone network, let's move ahead.

3-scale output:

It combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections.

As you can see in the architecture, three outputs are going to the prediction from 3 different mega blocks called scale1 whose tensor shape is (52*52*256),scale2 whose shape is (26*26*512),scale3 whose shape is (13*13*1024). This idea is taken from Feature Pyramid Networks (FPN) which is another model for object detection itself.

Model output formulation:

till now we have seen that how the image is passed through the network, so now let's see what should be the output of the model.

till we have down-sampled our image of shape (416*416) into 52*52,26*26,13*13.but so understanding let's just consider the output feature block of 13*13*1024 output. let's have a look at the image below.

for the prediction purposes first, we will divide our image into 13 x 13 blocks do in our case if the input image is of shape then the total blocks will be (416/13)*(416/13)=32*32, so each block of the grid (marked in red color) let's say will have a shape of 13 and there will be 32*32 such blocks.

so each block will contain 3 anchor boxes per grid cell as per the paper and for each cell, there will be 85 values. let me break this down.

let's say you are training on the MS-COCO dataset for object detection, we know that it has 80 sets of classes. so this number 85 will divide into 4 bounding box offsets +1 abjectness score + 80 class probabilities=85.

Anchor Boxes:

Anchor boxes are a set of predefined bounding boxes of a certain height and width. this idea is taken from YOLO-V2. they ran k-means clustering on the dimensions of bounding boxes on the MS-COCO dataset to get good priors for the model. The below image shows the average IOU we get with various choices for k. they found that k = 5 gives a good value for recall vs. complexity of the model.

bounding box offsets:

tx and ty: x and y coordinates of the offset of the bounding box for this cell.

tw and th: height and width of the bounding box for this cell.

these are the prediction for each cell. remember these are not the actual bounding boxes coordinates they are just offsets and width and height, we will calculate the actual bounding box using these coordinates.

Objectness Score:

it is not possible for each and 13*13 anchor-box has an object inside it, so to capture this information we have a binary label which will be 1 if the object is present in the grid cell and 0 otherwise.

Class Probabilities:

now that we have detected the object now we also want what is the type of this object like in MS-COCO: cat, dog, bird, etc. so if in our dataset we have 80 different then there will be 80 class probability values for each class.

Bounding Box Representation:

Here bx, by are the x,y center coordinates. we are passing our tx, ty values inside the sigmoid function to shrink them between 0–1 and then adding the cx,cy which are top-left coordinates of the cell grid. we are using sigmoid because if the prediction goes above 1 then the center of that bounding box might move into another cell grid which breaks the theory behind YOLO because if we postulate that the red box is responsible for predicting the object, the center of the object must lie in the current cell grid nowhere else. so we use a sigmoid to keep the center inside the cell grid itself.
bw, bh is the width and height of our bounding box, if the predictions bx and by for the box containing the object are (0.3, 0.8), then the actual width and height on the 13 x 13 feature map are (13∗0.3, 13∗0.8).
tx, ty, tw, th is what we got in our prediction.
cx and cy are the top-left coordinates of the grid.
pw and ph are anchors dimensions for the box. These pre-defined anchors are taken by running K-means clustering on a dataset.

multiscale output:

in the above explanation, we only saw the prediction for the 13*13 feature map. same concepts will apply for 26*26 and 52*52 feature outputs by applying the transformation in different grid dimensions we are detecting different sizes of objects .in 13*13 grid we can detect very small objects whereas in 26*26 grid we can detect medium-sized objects and for the 52*52 grid we can detect a larger object and by concatenating the outputs we get the results of all three combined on a single image.

At each scale, each grid cell predicts 3 bounding boxes using 3 predefined anchors, making the total number of anchors used 9. (The anchors are different for different scales).

Non-maximum Suppression

so for an input image, the model predicts (52*52 +26*26 +13*13)*3 =10647 boxes, and that is a lot. to tackle this we drop the boxes whose Class Probabilities is less than 0.5 topically and we use the ‘Non-maximum Suppression’ technique for the multiple bounding boxes surrounding the same object. to filter that out we use IOU(intersection over union) score which calculates how much overlap is there between actual and predicted boxes and use it to filter not-needed bounding boxes by comparing the actual bounding box and all the predicted bounding boxes for that particular object.

So, our required output shape is 13*13*425(5*(80+4+1)) where 5 is the number of anchor boxes, but we have the feature map of 13*13*1024. so to convert the feature to our desired output we simply use a 1 x 1 convolution layer to make this transformation.

13*13*1024 (feature map) — →(1*1*425 Conv) —→ 13*13*425(output)

Now our model is completed. now we can train it.