YOLOv3 — Real-time object detection

Karlijn Alderliesten
Analytics Vidhya
Published in
10 min readMay 28, 2020

A general outline of the YOLOv3-approach on real-time object detection, explained by taking a quick dive into convolutional neural networks. To make this comprehensible I left out the details and some additional steps. This piece assumes some basic knowledge about a neural network.

YOLOv3 — Real-time object detection

YOLOv3 is the most recent variation of the You Only Look Once (YOLO) approaches. This family of models is popular for real-time object detection which in 2015 was introduced in the paper “You Only Look Once: Unified, Real-Time Object Detection” by Joseph Redmon et al. Before diving into the YOLO_v3 method, let’s first explore the concept of image classification and object localization.

One of the most well-known computer vision tasks is image classification, aiming at assigning each image in a dataset to one of the two or more categories. For example, an image classification algorithm may be designed to distinguish images containing dogs from ones with cats. Basically, it answers the question “What is the main object in this picture?”

If you not only want to make use of detecting the presence of an object, but also of locating where this object is within the picture, you can make use of the method called object localization. This changes our question to “What is the main object in this picture and where is it?”.

Now, the most complex, and interesting, process of the three is object detection (leaving out object segmentation). In most real-life scenarios, we want our model to go beyond recognizing and locating only one object, and detect multiple objects in the same image. Object detection does just this and draws a so-called bounding box around each individual object. The question ends up being “What are the objects in this picture and where are they?”. A clear example where object detection is used, is self-driving cars. It can find other vehicles, traffic lights, pedestrians, signs and other entities that can influence driving behaviour. Because this action takes place concurrent with the actuality, it is labelled real-time object detection.

Differences between classification, object localization and object detection

How this works? All aforementioned approaches incorporate Convolutional Neural Network (CNN), a currently popular deep learning architecture. CNN’s mimic some features of the visual cortex, a region in the brain that receives, integrates and processes visual information. With describing how a CNN works, YOLO_v3 is partly explained.

Convolutional Neural Network

As we look at a picture of a cat, as humans we understand that it is a cat because of its characteristics, like the shape of its whiskers and fur. A computer only recognizes low-level features like dots, curves, bright spots and texture. So what a CNN does is match parts of the image instead of the entire picture. It reduces images into a form which is easier to process, without losing features which are critical for getting a good class prediction.

Rather than an image, computers see a tensor with shape (image height, image width, colour channels). For example it will observe a 4 x 4 x 3 array of numbers. These numbers are then assigned a value from 0 to 255, which describes the intensity of the pixel. This array of numbers (the image) passes certain layers to eventually end up with a predicted class label.

Representation of how the computer sees an image

The image below represents a general CNN pipeline. It might look a little baffling at this moment, but everything will be explained.

Convolutional Neural Network architecture

Convolutional layer

The first layer is called the convolutional layer. In the convolutional layer a filter (also called a kernel) is applied to the input. A filter, which is also an array of numbers, projects itself on a region of the input (the receptive field) and slides across all the areas of the input image. One step or slide of the filter is called a stride, which varies per CNN. Specifically, the filter is flipped prior to being applied to the input. It is of importance that the depth of this filter has to be equal to the depth of the input. The result is calculated per slide by computing element wise multiplications. These values are being summed to a single number which is being placed in the corresponding unit in the two-dimensional output array, a “feature map”. Technically, the convolution as described above is actually a cross-correlation.

Convolutional layer process

This process is comparable with identifying the previously mentioned curves and colours in a picture of a cat. With different filters operations such as edge detection, blur and sharpening can be performed. If for an image recognition problem a big amount of pixels is necessary to differentiate objects, use a large filter. If smaller and local features are of importance to recognize objects, use a smaller one. To eventually recognize higher level features like a cats nose and ears, not only one layer, but a whole network is needed.

(To keep things simple I left out the concept of padding. Feel free to look it up when interested.)

Non-linear activation function

Right after the convolution has taken place, each value of the feature map is passed through a non-linear activation function. Like the name already gives away, it brings a nonlinear property which allows you to model the response variable (class label) that varies non-linearity with its explanatory variables. Meaning, it enables complex mappings between input and output layers in the network. The aspect determines the ultimate outcome of the model, accuracy and computational efficiency. The activation is basically a mathematical “gate” that determines whether each neuron’s input is relevant for the model’s prediction. If it meets a certain threshold, it passes through the gate, if not, it is being disregarded. The function brings the relevant features on the image into focus. The most commonly used non-linear activation function at this moment is the Rectified Linear Unit (ReLU) function.

Graphical representation of ReLU

It is applied per pixel and replaces all negative pixel values in the feature map by zero. F(x) = max(0,x).

Pooling layer

After the feature map is adjusted, it is passed on to the pooling layer or downsampling layer. Similar to the convolutional layer, the pooling layer prunes the spatial size of the representation by reducing the amount of parameters and computation to process the data in the network. Only the most dominant features are kept and the process of effectively training the model can be maintained. Besides that, the decreased amount of parameters controls the model from overfitting. Overfitting takes place when in the evaluation phase of your model, it performs relatively high on your training set and low on your test set.

There are several pooling layer options, with the two most popular being max pooling and average pooling. In the pooling layer a filter (normally of size 2x2) and a stride of two is applied to the input feature map and creates an output with a number in every subregion the filter convolves around. In essence, it summarizes the features detected in the input. For the max pooling method this will be the maximum number in each subregion.

Process of max pooling

Together, the convolutional layer, non-linear activation function and the pooling layer extract the useful features from an image, introduce non-linearity and reduce feature dimension. These layers are elements of the hidden layers displayed in the figure of the CNN architecture and are repeated multiple times in the network. The aim of hidden layers is to assemble the features analogous to scale and translation.

The first line of layers acquires combinations of low-level features, such as multiple lines and brightness of colours. In the following lines of layers, features of increasingly higher order come to surface, continuing until the depth of the network.

Fully connected layer

After the hidden layers, the model is able to understand the features and we can start the classification process. The output from the pooling layer is flattened to one long vector and passed through a fully connected layer, which is a feed-forward neural network (and backpropagation is applied to every iteration of training). Each of the values in this vector represents a probability that a certain feature belongs to one of the classes. Coming back to the example of the cat, the features representing a cat’s nose or ears should have high probabilities to be classified as “cat”. The final layer then uses the softmax activation function to get the ultimate probabilities of the image fitting in a particular class. It normalizes the CNN’s output to fit between zero and one, and essentially represents a probability distribution.

Hyperparameters

Until today there is no real viable theory about optimization of hyperparameters. The amount of convolutional layers to use, for example, depends on the type of data you have. The same goes for the filter size and stride. It all depends on the processing task and your input data, which varies by size, complexity of the image, quality and so on. There are a few hyperparameter optimization toolsets available, but many are not evolved yet to be applied in large scale deep learning scenarios. One approach to choose hyperparameters for your case is to find a combination that creates abstractions of the image at a proper scale.

In the end, you want your model to accurately predict object classes for new data coming in. One way to reduce overfitting, where after training the weights of the network are excessively tuned to the training data, is using an optional dropout layer. The function “drops” a percentage of random activations in the added layer by setting them to zero. This may sound as a counterintuitive process as you lose some data. However, it means that the model should be capable of correctly classifying a specific example, even if some information is lost. Obviously, this layer is only used in the training-phase and afterwards removed from the model.

Now there is a basic understanding of CNN’s we can move on to the YOLO family of models.

YOLO

The YOLO models are end-to-end deep learning models and are well-liked because of their detection speed and accuracy. Additionally, the methods learn generalizable representations of objects which is of essence when a model is applied in real life. The structure of a YOLO network is similar to a normal CNN. It consists of several convolutional and max pooling layers, ending with two fully connected layers.

Previous methods, like region-based convolutional neural networks (R-CNN), require thousands of network evaluations to make predictions for one image which can be time-consuming and painful to optimize. It focuses on a specific area of the image and trains each individual component separately. A YOLO model on the other hand, only passes the image once through the neural network (“You Only Look Once”).

The network divides the image into a grid of cells, which all predict five bounding boxes and object classifications. The boxes having a low probability of containing an object, and the ones sharing large areas with other boxes will be removed by a process called non-maximal suppression.

Division into a grid, five bounding boxes per cell and the resulting bounding boxes

YOLOv3

The most current of the main versions is the third iteration of the approach, namely YOLOv3. In each version improvements have been made over the previous one. The initial version proposed the general architecture, after which the second variation improved accuracy significantly while making it faster. YOLOv3 refined the design further by using tricks, such as multi-scale prediction and bounding box prediction through the use of logistic regression. While the accuracy increased dramatically with this version, it traded off against speed which reduced from 45 to 30 frames per second.

YOLOv3 uses a variant of Darknet, a framework to train neural networks, which originally has 53 layers. For the detection task another 53 layers are stacked onto it, accumulating to a total of a 106-layer fully convolutional architecture. This explains the reduction in speed in comparison with the second version, which only has 30 layers.

YOLOv3 architecture

In the convolutional layers, kernels of shape 1x1 are applied on feature maps of three different sizes at three different places in the network. The algorithm makes predictions at three scales, given by downsampling the dimensions of the image by a stride of 32, 16, 8 respectively. Downsampling, the reduction in spatial resolution while keeping the same image representation, is done to reduce the size of the data. Every scale uses three anchor bounding boxes per layer. The three largest boxes for the first scale, three medium ones for the second scale and the three smallest for the last scale. This way each layer excels in detecting large, medium or small objects.

Previously in YOLO, the softmax activation function determining the classes of objects in bounding boxes was also employed in these models. Authors of YOLOv3 have refrained from softmaxing the classes, since the method rests on the assumption that classes are mutually exclusive. For example, if there are classes like “cat” and “animal” in the dataset and one of the objects in the bounding boxes is a cat, this assumption fails, because a cat is also an animal. Instead, independent logistic classifiers predict each class score and a threshold is used to perform multilabel classification for objects detected in images. An element belonging to a certain class will not be influenced by the decision of that element belonging to another class (binary cross-entropy loss).

YOLOv3 performance on COCO 50 Benchmark

As shown in the graph above, YOLOv3 achieves best speed-accuracy trade of on the MS COCO dataset, a large-scale object detection dataset. With a mean average precision (mAP) of 57.9% in 51ms the YOLOv3 exceeds the RetinaNet-101 by 57.5% in 198ms. The YOLOv3 is thus almost four times faster with an equal mAP.

In the video below you can see YOLOv3 in action. Pretty cool, right?

https://www.youtube.com/watch?v=MPU2HistivI&list=PLS8mdS1BdhgmpXgJg7wiehognGht8hxmf&index=257

--

--