Object Detection in 2020 — From RCNNs to YOLOv4

Code Heroku

Published in

Code Heroku

10 min readJun 28, 2020

What is Object Detection?

Thinking in terms of a layman, what actually is object detection?

Let’s walk through a sample image:

As a human, we understand that the image contains moving cars, But how do we inject this understanding to a computer? Pretty fascinating if you think about it.

The problem can be broken down into two sub-problems.

Object localization and Image classification.

Image classification task, if performed successfully, will tell us if the image contains a ‘banana’ or ‘apple’, or both. It assigns a label to an image.

Object localization task, if performed successfully, will tell us where the objects are in the image by drawing a bounding box around them.

Object detection is nothing but a combination of localization and classification.

In other words, it is the problem of finding and classifying a variable number of objects on an image.

So how does one approach with the challenge of object detection?

Classical Machine Learning approaches which used to work in realtime are in-fact, used today as well. One of them is Viola-Jones framework, an algorithm which uses Haar Features for real-time face detection.

However, with the development of deep learning approaches, algorithms now have become much more robust in understanding not just spatial, but contextual information as well.

For example, consider a human body. Spatial features would include figuring out what a head looks like, or what a leg looks like. Contextual features would include information such as head is connected to neck, or arm is connected to the torso, and likewise.

Existing Network Architectures:

One of the best performing deep learning approaches is the RCNN family. It was first proposed in 2013, and stands for Region based CNNs.

How does R-CNN work?

It’s a multi-stage pipeline which outputs a label as well as a bounding box to a variable number of objects.

The first stage can broadly be referred to as Image Segmentation task. The purpose of this stage is to find out Region Proposals, or in simple words, parts of interest in the image.

Now the further pipeline will be focused only on the regions which the first stage outputs. The algorithm used in the paper is as follows:

Selective Search:

1. Generate initial sub-segmentation, we generate many candidate regions

2. Use greedy algorithm to recursively combine similar regions into larger ones

3. Use the generated regions to produce the final candidate region proposals

However proposed to be optimal, this algorithm is slower than most of the other deep learning approaches, two of the main reasons include the greedy approach as there’s no learning involved, and the segmented regions are classified one at a time, requiring a lot of computation.

To tackle the computation required for classification of objects one at a time, Fast R-CNN was introduced in 2015, which used convolution outputs instead of fully connected layers, sharing a lot of computation.

How does Fast R-CNN work?

Let’s take an example, a normal Image Classification network architecture would look like:

Now the same fully connected layers can be modified into Convolution layers, which in turn gives us the benefit of the shared computation.

[Note that the softmax layer after the above fully connected layer could be reproduced as an activation for the below convolution layer as well, but this isn’t done. Instead of binary outputs, a feature map is generated.]

Source: Deep Learning Specialization by Coursera

This solution was widely accepted for other areas of computer vision research as well. Having solved the classification of objects one at a time, this algorithm is yet again slow.

Faster R-CNN comes to rescue!

The selective algorithm in R-CNN is a “fixed” algorithm. Thereby a computer wouldn’t learn at this stage, and this might lead to bad candidates for region proposals.

Also, when the runtimes were compared, it was realized that region proposals were the bottlenecks in the performance of the family of algorithms.

So how does Faster R-CNN promised to be a better candidate?

Instead of using the selective search algorithm on the feature map to identify the region proposals, a separate network was used to predict the region proposals. That, in turn, eliminated the selective search algorithm and led the overall network to learn the region proposals too.

And since it was implemented by convolutions, this algorithm became the fastest and most accurate object detection algorithm among its family.

But let’s not hatch the eggs before they’re born, let’s look at a speed vs mAP(mean Average Precision) graph.

Yolo inferences much faster!

Why is that? Let’s compare the difference between YOLO and RCNN:

YOLO and Faster R-CNN both share some similarities. They both use an anchor box based network structure, and both use bounding both regression.

How are they different?

Thing that makes YOLO differ from Faster R-CNN is that it makes classification and bounding box regression at the same time.

However, Yolo does have it’s drawback in object detection. It doesn’t generalize well when objects in the image show rare aspects of ratio.

Faster R-CNN on the other hand, does detect small objects well, however it fails to do real-time detection with its two step architecture.

With that said, let’s jump onto setting up Yolov4 on a local machine, because it can be trained on a local not-so-expensive gpu.

Training on a custom data set with YOLO v4

Let’s start by downloading cuda and cudnn libraries, for GPU acceleration.

Note: Currently only NVIDIA gpus are supported, https://developer.nvidia.com/cuda-gpus check this link to see if your GPU is supported.

Go to https://developer.nvidia.com/cuda-toolkit-archive, download the 10.0 toolkit. If you’re on Linux, download the runfile(local) version.
While downloading, open a terminal and type the following command:

nvidia-smi

If you get a nice structured layout with your GPU mentioned, in the following step, type ’n’ when the installer asks to install graphics drivers.

After the download completes, open up a terminal to the location you’ve downloaded and run the following command:

sudo sh <the file you downloaded>

Make sure you type ‘y’ when the installer asks to create a symlink with /usr/local/cuda

After completion, the installer would ask you to add /usr/local/cuda to your PATH, and LD_LIBRARY_PATH environment variables.
If you’re on Bash shell, just type:

export PATH=/usr/local/cuda/bin:$PATHexport LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Alternatively, you could add these commands to ~/.bashrc for everytime you open a terminal it’ll be automatically exported for you.

We’re done installing the cuda and cudnn libraries!

Let’s install OpenCV, to boost the data augmention speeds.

Download the source files from OpenCV releases, you can use the latest 4.3.0 here: https://opencv.org/releases/
Download the source files from OpenCV contrib repository, here’s the link:

https://github.com/opencv/opencv_contrib/releases

Go to the OpenCV downloaded folder, inside the directory, create a folder named ‘build’.

cd opencv-x.x.xmkdir build && cd build

Run the following command to open up the gui of cmake:

cmake-gui

Initially, it’ll all be empty, press Configure to generate these configurations.

The only important flag here is to check:

BUILD_opencv_world

Set OPENCV_EXTRA_MODULES_PATH to:

<path to opencv_contrib>/modules

Press Generate.

Now go to the build folder, type the following command:

Make -j8 <where 8 are the number of CPU cores, if you don’t have an idea about the number of cores of your CPU, 1 works fine. This is helpful for parallel computation.>

Note: This step will take some time.

After the above command finishes, if you don’t come across any error, type the following command:

export OpenCV_DIR=<path to the build folder>

That’s it! We have completed building OpenCV library.

Till now we have installed the cuda and cudnn libraries for gpu computations, and opencv for faster data augmentations and processing of images and videos.

Let’s now compile Darknet.

Heard of TensorFlow, PyTorch? Darknet is yet another amazing framework built over C for computation extensive tasks. We will use Darknet to run inference of yolo.

Compiling Darknet:

Clone the darknet repository given by AlexeyAB: https://github.com/AlexeyAB/darknet

git clone https://github.com/AlexeyAB/darknet.gitcd darknetTo compile it on Linux using CMake:
./build.shTo compile it on Windows using CMake:./build.ps1

Alternatively, we can use make command with modified parameters such as:

GPU=1 to build with CUDA to accelerate by using GPU (CUDA should be in /usr/local/cuda)

CUDNN=1 to build with cuDNN v5-v7 to accelerate training by using GPU (cuDNN should be in /usr/local/cudnn)

CUDNN_HALF=1 to build for Tensor Cores (on Titan V / Tesla V100 / DGX-2 and later) speedup Detection 3x, Training 2x

OPENCV=1 to build with OpenCV 4.x/3.x/2.4.x — allows to detect on video files and video streams from network cameras or web-cams

DEBUG=1 to build debug version of Yolo

OPENMP=1 to build with OpenMP support to accelerate Yolo by using multi-core CPU

Where to make these changes? Simply edit the ‘Makefile’ using a text editor.

If that runs successfully, there will be a binary named ‘darknet’, which we will use to run inference.

We’re done!

Training Yolov4 on coco dataset:

Now that we have Darknet compiled, we’ll use it to train Yolov4 architecture on custom dataset.

Training-prerequisites:

Let’s assume we have a folder containing some images, upon which we will train the architecture.

If the data is unlabeled, i.e. the ground truths are not available, we can use the GUI utility by AlexeyAB: https://github.com/AlexeyAB/Yolo_mark
If the data is labeled, we convert it to the format the pipeline accepts.
The format is as follows:

object-class is the id to which the class is labeled (more on that below)

The rest 4 parameters are relative to the image size.

So they must be in range of 0–1.

Note: x_center and y_center are the center point of the box.

For example: Suppose we have an image named img1.jpg, w.r.t to the image, img1.txt will be created using the utility, or already labeled data to be converted to the wanted format. The content of img1.txt should be like:

1 0.111 0.112 0.113 0.1140 0.121 0.122 0.123 0.1241 0.131 0.132 0.133 0.134

3 lines denote the image contains 3 objects, where 2 of them belong to class 1.

Copy the contents of yolo4-custom.cfg located at <darknet repository>/cfg/ to a new file named as yolo-obj.cfg
Edit the content as follows:

Change max_batches to (classes*2000), if we have 4 classes, max_batches=8000.

Note: Make sure the number of images in training data is larger than max_batches

Change steps to (0.8*max_batches, 0.9*max_batches), if we have 4 classes steps=6400,7200.
Network size width=416, height=416 (or any multiple of 32)
Modify classes=80 to classes=<number of classes>, if we have 4 classes, classes=4.

Note: This needs to be done for all the 3 layers of YOLO. (Find and Replace works wonders here)

Modify filters=255 to filters=(classes + 5)x3 in the convolutional layers immediately before each 3 yolo layers, if we have 4 classes, filters=27

Create file obj.names in the directory <darknet repository>/build/darknet/x64/data/, with objects names — each in a new line.
Create file obj.data in the directory <darknet repository>/build/darknet/x64/data/ with following data:

classes = <number of classes> for eg, classes = 4

train = data/train.txt

valid = data/test.txt

names = data/obj.names

backup = backup/

Place the content of the directory where the images and their labels in .txt format are stored, to <darknet repository>/build/darknet/x64/data/obj/
Create file train.txt in directory <darknet repository>/build/darknet/x64/data/, with filenames of the images, each filename in new line, with path relative to darknet binary, for eg., with following content:

data/obj/img1.jpg

data/obj/img2.jpg

data/obj/img3.jpg

We’re done setting up the training-prerequisites!

We can now start training, or restore training from the backup folder we defined in obj.data (see step 8).

Before training, download the convolution data from here: https://drive.google.com/file/d/1JKF-bdIklxOOVy-2Cr5qdvjgGpmGfcbp/view

Place it in <darknet repository>/build/darknet/x64/

Open up a terminal and type:

./darknet detector train data/obj.data yolo-obj.cfg yolov4.conv.137

File yolo-obj_last.weights will be saved to the <darknet repository>/build/darknet/x64/backup\ for each 100 iterations
File yolo-obj_xxxx.weights will be saved to the <darknet repository>/build/darknet/x64/backup\ for each 1000 iterations
To disable Loss-Window use:

./darknet detector train data/obj.data yolo-obj.cfg yolov4.conv.137 -dont_show

To see the mAP & Loss-chart during training on remote server without GUI, use:

./darknet detector train data/obj.data yolo-obj.cfg yolov4.conv.137 -dont_show -mjpeg_port 8090 -map

Now open URL http://ip-address:8090 in Chrome/Firefox browser

We’ve started training!

Use this command to test the trained model:

./darknet detector test data/obj.data yolo-obj.cfg yolo-obj_8000.weights

Note: 8000 is the training step for which weights have been stored

Other Implementations:

Yolov4 has also been written in TensorFlow, which can be converted to TFLite and used in an android application!

Here is the link: https://github.com/hunglc007/tensorflow-yolov4-tflite

The TensorFlow version has been issued in PyPi, which can be used to train, infer, and convert to tflite all at the same place!

Here is the link: https://pypi.org/project/yolov4/

Credits: This post was originally created by Aitik Gupta as part of summer internship program @Code Heroku. You can reach him at aitikgupta@gmail.com