A Real-Time Object Detection model using YOLOv3 algorithm for non-GPU Computers

Published in

Nerd For Tech

5 min readOct 22, 2020

Co-author Dhrumilparikh

You only look once (YOLO) is a state-of-the-art, real-time object detection system. YOLOv3 is extremely fast and accurate.

YOLOv3, an emerging object detection model created to run on a Laptop or Desktop Computers coming up short on a Graphics Processing Unit (GPU). The ‘You Only Look Once’ v3 (YOLOv3) model is widely used in deep learning-based object detection methods. The model was prepared on the COCO dataset, accomplishing a mean Average Precision(mAP) of 57.9%. YOLOv3 runs at around 20 FPS on a non-GPU Computer.

YOLO has gone through three iterations, each one of them is a gradual improvement over the previous one. You can check each one of the articles:

Pre-requisites

OpenCV
NumPy
Configuration file and weight file of YOLO
Pandas
COCO name file
Python(3.0 or above)

What is YOLO ?

YOLO is a clever convolutional neural network (CNN) for doing object detection in real-time. The algorithm applies a single neural network to the full image, and then divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities. High scoring regions of the image are considered detections.

Our model has several advantages over classifier-based systems. It looks at the whole image at test time so its predictions are informed by global context in the image. It also makes predictions with a single network evaluation unlike systems like R-CNN which require thousands for a single image. This makes it extremely fast, more than 1000x faster than R-CNN and 100x faster than Fast R-CNN.

The architecture of YOLOv3:

Each boundary box contains 5 elements: (x, y, w, h) and a box confidence score. The confidence score reflects how likely the box contains an object and how accurate is the boundary box. Then normalize the bounding box width w and height h by the image width and height. x and y are offsets to the corresponding cell. Hence, x, y, w and h are all between 0 and 1. Each cell has 20 conditional class probabilities. The conditional class probability is the probability that the detected object belongs to a particular class (one probability per category for each cell). So, YOLO’s prediction has a shape of (S, S, B×5 + C) = (7, 7, 2×5 + 20) = (7, 7, 30).
The major concept of YOLO is to build a CNN network to predict a (7, 7, 30) tensor. It uses a CNN network to reduce the spatial dimension to 7×7 with 1024 output channels at each location. YOLO performs a linear regression using two fully connected layers to make 7×7×2 boundary box predictions (the middle picture below). To make a final prediction, keep those with high box confidence scores (greater than 0.50) as final predictions.

Implementation :

We need to create a python file in any python IDE and import all the necessary packages.

import cv2
import numpy as np
import time

2. Load the trained weights and configuration file of YOLOv3.

net = cv2.dnn.readNet('yolov3.weights','yolov3.cfg')classes = []with open('coco.names','r') as f:classes = f.read().splitlines()cap = cv2.VideoCapture(0)

3. Now, we divide an image into multiple regions and predicts bounding boxes. Then it will make boxes of which region has greater than 0.5 confidence .

font = cv2.FONT_HERSHEY_DUPLEXstarting_time = time.time()frame_id = 0while True:   _, img = cap.read()   height, width, _ = img.shape   frame_id += 1   blob = cv2.dnn.blobFromImage(img, 1/255, (416,416), (0,0,0),       swapRB = True, crop = False)   net.setInput(blob)   output_layers_names = net.getUnconnectedOutLayersNames()   layersOutputs = net.forward(output_layers_names)   boxes = []   confidences = []   class_ids = []   for output in layersOutputs:      for detection in output:         scores = detection[5:]         class_id = np.argmax(scores)         confidence = scores[class_id]         if confidence > 0.5:            center_x = int(detection[0]*width)            center_y = int(detection[1]*height)            w = int(detection[2]*width)            h = int(detection[3]*height)            x = int(center_x - w/2)            y = int(center_y - h/2)            boxes.append([x, y, w, h])            class_ids.append(class_id)            confidences.append((float(confidence)))    print(len(boxes))    indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)    colors = np.random.uniform(0, 255, size = (len(boxes), 3))

4. This will make rectangle boxes of different colours and give labels by comparing the objects with pre-trained weights and that’s it. Enjoy !!

    for i in range(len(boxes)):       if i in indexes:          x, y , w, h = boxes[i]          label = str(classes[class_ids[i]])          confidence = str(round(confidences[i],2))          color = colors[i]          cv2.rectangle(img, (x,y), (x+w,y+h), color, 8)          cv2.putText(img, label + " " + confidence, (x, y+20), font, 1, (255,255,255) , 2)    elapsed_time = time.time() - starting_time    fps = frame_id / elapsed_time    cv2.putText(img, "FPS: " + str(fps), (10,30), font, 1, (0,0,0), 1)    cv2.imshow('Output',cv2.resize(img,(700, 500)))  # cv2.resize(img,(600, 400))    key = cv2.waitKey(1)  #0    #if the 'c' key is pressed, stop the loop    if key == ord('c'):       breakcap.release()cv2.destroyAllWindows()

Results will look like this :

Summary:

In this model, YOLOv3 achieved its goal of bringing object detection to non-GPU computers. This model can also detect objects from recorded image or video. In addition, YOLOv3 offers contributions to the field of object detection. Moreover, YOLOv3 shows that shallow networks have immense potential for lightweight real-time object detection networks. Running at 20 FPS on a non-GPU computer is very promising for such a small system. As well as, this model YOLOv3 shows that the use of batch normalization should be questioned when it comes to smaller shallow networks. Movement in this area of lightweight real-time object detection is the last frontier in making object detection usable in everyday instances.