Build a Real-Time Object Detection Model With YOLOv3

Did you ever want to build an Object Detection model but lacked a GPU to train the mode on? The here’s an example of a real-world project for Computer Vision enthusiasts.

Dhrumilparikh

Published in

Discover Computer Vision

4 min readOct 23, 2020

Example images passed through our YOLOv3 COCO model.

Introduction:

You only look once (YOLO) is a state-of-the-art, real-time object detection system. YOLOv3 is extremely fast and accurate

YOLOv3, an emerging object detection model created to run on a Laptop or Desktop Computer coming up short on a Graphics Processing Unit (GPU). The ‘You Only Look Once’ v3 (YOLOv3) model is widely used in deep learning-based object detection methods. The model was prepared on the COCO dataset, accomplishing a mean Average Precision(mAP) of 57.9%. YOLOv3 runs at around 20 FPS on a non-GPU Computer.

YOLO has gone through three iterations, each one of them is a gradual improvement over the previous one. You can check each one of the articles:

Pre-requisites

Python(3.0 0r above)
COCO dataset
YOLOv3 pre-trained weights
Numpy
OpenCV

What is YOLO?

YOLO is a clever convolutional neural network (CNN) for doing object detection in real-time. The algorithm applies a single neural network to the full image, and then divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities. High scoring regions of the image are considered detections.

Our model has several advantages over classifier-based systems. It looks at the whole image at test time so its predictions are informed by the global context in the image. It also makes predictions with a single network evaluation unlike systems like R-CNN which require thousands for a single image. This makes it extremely fast, more than 1000x faster than R-CNN, and 100x faster than Fast R-CNN.

The architecture of YOLOv3:

Each boundary box contains 5 elements: (x, y, w, h) and a box confidence score. The confidence score reflects how likely the box contains an object and how accurate is the boundary box. Then normalize the bounding box width w and height h by the image width and height. x and y are offsets to the corresponding cell. Hence, x, y, w, and h are all between 0 and 1. Each cell has 20 conditional class probabilities. The conditional class probability is the probability that the detected object belongs to a particular class (one probability per category for each cell). So, YOLO’s prediction has a shape of (S, S, B×5 + C) = (7, 7, 2×5 + 20) = (7, 7, 30).
The major concept of YOLO is to build a CNN network to predict a (7, 7, 30) tensor. It uses a CNN network to reduce the spatial dimension to 7×7 with 1024 output channels at each location. YOLO performs a linear regression using two fully connected layers to make 7×7×2 boundary box predictions (the middle picture below). To make a final prediction, keep those with high box confidence scores (greater than 0.50) as final predictions.

Building Object Detection model:

We need to create a Jupyter notebook and import all the necessary packages and load YOLO

Importing the necessary packages & loading YOLOv3

2. Load the trained weights and configuration file of YOLOv3

Enable OpenCV to capture video feed

3. Detecting Objects and predicting bounding boxes

Train the Deep Neural Net on the video feed

4. Showing pieces of information on the screen

Make inferences from the trained model

5. Labeling the image & you’re done!

Labeling the inferred predictions

Results of object-detection

Example image to predict the objects on & make inferences.

Inferred output after predicting the objects on the image.

Inferred image with the bounding boxes around the objects | Source: Author

Summary :

In this model, YOLOv3 achieved its goal of bringing object detection to non-GPU computers. In addition, YOLOv3 offers contributions to the field of object detection. First, YOLOv3 shows that shallow networks have immense potential for lightweight real-time object detection networks. Running at 20 FPS on a non-GPU computer is very promising for such a small system. Second, in this model YOLOv3 shows that the use of batch normalization should be questioned when it comes to smaller shallow networks. Movement in this area of lightweight real-time object detection is the last frontier in making object detection usable in everyday instances.

Further Learning Resources: