Guide to Object Detection using YOLO

Jantakarn
9 min readMay 5, 2020

--

Explain the principles of YOLO easily and how to do step by step.

Kanittha 6214552611 and Jantakarn 6214552620.

Zebra Detection

Object detection is a part of computer vision that involves specifying the type and type of objects detected. Object detection and object classification is a challenge. In recent years, deep learning has developed a lot in detecting objects. Previously, there were methods or tools used: R-CNN, Fast-RCNN, Faster-RCNN, YOLO, SSD, etc. We are interested in “You Only Look Once” (YOLO) is a kind of Convolutional Neural Networks. When tested, it will give accurate results and satisfactory speed. Therefore, we studied the work of YOLOv3 by using Yolo3-tiny to detect both image and video objects.

What is an object detection

Object detection in image

Object Detection is a computer-related technology in computer vision and image processing that is used in AI. Detects specific types of objects, such as humans, cars, and buildings that are in images or videos. Previously, there are methods such as R-CNN, Fast-RCNN, Faster-RCNN, YOLO, SSD, etc. In this article, will study the use of YOLO to detect objects.

What is YOLO

YOLO (You Only Look Once) is an algorithm for detecting objects. The object detection consists of determining the position on the image, as well as classifying those objects. The previous methods, such as the R-CNN, Fast-RCNN, Faster-RCNN others, are slow and difficult to optimize because each component must be trained separately. YOLO is a convolutional neural network (CNN) for doing object detection in real-time. The algorithm applies a single neural network to the full image, and then divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted.

How does YOLO work?

Grid cells the idea of ​​dividing images into grid cells are unique in YOLO by defining grids as S x S. If the center of the object is to which grid of cells the cell grid will predict the object.

Each cell grid consists of :

Outputs each grid cell
  • pc is the probability of having an object in the grid cell. If there are no objects, then ignore them.
  • bx, by is the center of the object found in the grid cell. If the center of the object does not fall to that grid cell, then there is no need to calculate.
  • bw, bh is the width and height of the box
  • c is the number of classes will be calculated according to the specified class

As an example:

Sample picture to explain how YOLO works

From the example, we will divide the grid cell into 3x3 and there are 3 classes which are people, dogs, and cats for easy to understand. So, this image will have y as follows :

From the picture, you can see that the center of the image is in a different grid cell, so you can find y from :

Calculating the y value of each object

For all grids, we have a vector output. This result will be in 3 X 3 X 8.

Anchor Box

Anchor boxes are YOLO’s algorithm that separates objects if multiple image centers are in the same grid cell.

From the picture, you can see that the center of the two objects are in the same grid cell. The solution is to add a “dimension” to the output. In which we will look at these two objects separately first

Find the y of each anchor box and then add them together.

Therefore, the prediction of all vector class outputs will be 3 X 3 X (2 * 8) will be 3 X 3 X 16, which can be obtained from S X S X (B * 5 + C).

  • S x S is the grid
  • B is anchor box
  • C is class

Intersection over Union (IoU)

It is an evaluation indicator used to measure the accuracy of the detector. Which will compare the predicted box with the detectable box, how can we decide that is a good prediction? It can calculate the area as follows:

If the IoU value is close to 1, then our predictions are highly accurate.

Non-Max Suppression

It is a general algorithm used for predicting multiple boxes for the same object. It considers the probability of an object together with the IoU value, which is when the object has been detected and the probability of multiple objects, we will consider the highest probability first. Then take the remaining boxes to find the IoU value and then find the box with the higher IoU values ​​specified.

source

Implementing YOLO in Python

First, we will try to detect the objects in the image.

Launch library

import cv2
import numpy as np

Load Yolo algorithm from yolov3-tiny.weights and yolov3-tiny.cfg

  • Weight file: It is a trained model for the object detection algorithm.
  • Cfg file: is a configuration file that settings the algorithm.
  • File Name: contains the name of the object that can detect.
net = cv2.dnn.readNet("yolov3-tiny.weights", "yolov3-tiny.cfg")
classes = []
with open("coco.names", "r") as f:
classes = [line.strip() for line in f.readlines()]
layer_names = net.getLayerNames()
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]
colors = np.random.uniform(0, 255, size=(len(classes), 3))

Load images using cv2.imread and resize by cv2.resize

img = cv2.imread("boy.jpg")
img = cv2.resize(img, None, fx=0.7, fy=0.7)
height, width, channels = img.shape

the blob = cv2.dnn.blobFromImage algorithm and set the scale factor as needed.we had to convert it to blob to extract features from images and resize them. YOLO accepts 3 sizes :

  • 320 × 320 Small, accurate but fast
  • 416 × 416 Medium, not very accurate, not very fast
  • 609 × 609 Large, very high accuracy Has a low speed

Therefore, we select 416 × 416.

Use the net.setInput (blob) function to pass the blob variable to the net variable. Use the net.forward (output_layers) function to pass the final value to the output_layers variable and store the value in the outs variable.

blob = cv2.dnn.blobFromImage(img, 0.00392, (416, 416), (0, 0, 0), True, crop=False)
net.setInput(blob)
outs = net.forward(output_layers)

At this point, the detection is complete and we just want to display the results on the screen.

Create an empty array of 3 variables: class_ids, confidences, and boxes. The detection variable stores the index we send and stores the scores variable. Then detect the object by placing the positions centerx and centery at index 0 and 1 times the width and height. The positions w and h at index 2 and 3 multiply by the width and high.

class_ids = []
confidences = []
boxes = []
for out in outs:
for detection in out:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence > 0.5:
# Object detected
centerx = int(detection[0] * width)
centery = int(detection[1] * height)
font = cv2.FONT_HERSHEY_PLAIN
w = int(detection[2] * width)
h = int(detection[3] * height)
# Rectangle coordinates
x = int(centerx - w / 2)
y = int(centery - h / 2)
boxes.append([x, y, w, h])
confidences.append(float(confidence))
class_ids.append(class_id)

Use the non-max suppression function to remove unwanted boxes.

indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)

Draw a rectangle in the x and y positions and use cv2.imshow to show all information on the screen and cv2.imwrite to save the result.

font = cv2.FONT_HERSHEY_PLAIN
for i in range(len(boxes)):
if i in indexes:
x, y, w, h = boxes[i]
label = str(classes[class_ids[i]])
color = colors[i]
cv2.rectangle(img, (x, y), (x + w, y + h), color, 2)
cv2.putText(img, label, (x, y), font, 2, color, 2)
cv2.imshow("Image", img)
cv2.imwrite("save.jpg",img)
cv2.waitKey(0)
cv2.destroyAllWindows()

Before

After

Before

After

When we have detected the object in the picture We will experiment with detecting objects from the video.

Import library cv2 and library numpy

import cv2
import numpy as np
import time

Load Yolo algorithm from yolov3-tiny.weights and yolov3-tiny.cfg.

net = cv2.dnn.readNet("yolov3-tiny.weights", "yolov3-tiny.cfg")
classes = []
with open("coco.names", "r") as f:
classes = [line.strip() for line in f.readlines()]
layer_names = net.getLayerNames()
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]
colors = np.random.uniform(0, 255, size=(len(classes), 3))

Load Video or if anyone wants to use the camera from a notebook can change to cap = cv2.VideoCapture(0). Set the path to save the video.

cap = cv2.VideoCapture('filename.avi')
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))
output = cv2.VideoWriter('outpy.avi', cv2.VideoWriter_fourcc('M', 'J', 'P', 'G'), 10, (frame_width, frame_height))
font = cv2.FONT_HERSHEY_PLAIN
starting_time = time.time()
frame_id = 0

Run the loop while we separate the frame from the video.

while (True):
ret, frame = cap.read()
frame_id += 1
height, width, channels = frame.shape
blob = cv2.dnn.blobFromImage(frame, 0.00392, (416, 416), (0, 0, 0), True, crop=False)
net.setInput(blob)
outs = net.forward(output_layers)

The detection is complete and we just want to display the results on the screen.

   class_ids = []
confidences = []
boxes = []
for out in outs:
for detection in out:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence > 0.2:
# Object detected
center_x = int(detection[0] * width)
center_y = int(detection[1] * height)
w = int(detection[2] * width)
h = int(detection[3] * height)
# Rectangle coordinates
x = int(center_x - w / 2)
y = int(center_y - h / 2)
boxes.append([x, y, w, h])
confidences.append(float(confidence))
class_ids.append(class_id)
indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.4, 0.3)

Draw a rectangle in the x and y positions in the object to hold the array’s x, y, w, h position in the boxes variable. The cv2.putText command displays the label names that have been detected. The cv2.imshow (“Image”, frame) command displays a result frame named Yolo. To stop the operation, press the ‘q’ button. The output.write(frame) to save the video.

   for i in range(len(boxes)):
if i in indexes:
x, y, w, h = boxes[i]
label = str(classes[class_ids[i]])
confidence = confidences[i]
color = colors[class_ids[i]]
cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
cv2.rectangle(frame, (x, y), (x + w, y + 30), color, -1)
cv2.putText(frame, label + " " + str(round(confidence, 2)), (x, y + 30), font, 2, (255, 255, 255), 2)
print(label,confidence,x,y,w,h)
elapsed_time = time.time() - starting_time
fps = frame_id / elapsed_time
cv2.putText(frame, "FPS: " + str(round(fps, 2)), (10, 50), font, 2, (0, 0, 0), 2)
output.write(frame)
cv2.imshow('frame', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
output.release()
cv2.destroyAllWindows()

For the result when compiling all program files that show program can detect the like this video.

Summary

When we look at images or videos, we can easily locate and identify the objects of our interest within moments. Passing on of this intelligence to computers is nothing but object detection, locating the object, and identifying it. Object Detection has found its application in a wide variety of domains such as video surveillance, image retrieval systems, autonomous driving vehicles, and many more. Various algorithms can be used for object detection but we will be focusing on the YOLOv3 algorithm. YOLO is one of the fastest real-time object detection algorithm (45 frames per second) as compared to the R-CNN family (R-CNN, Fast R-CNN, Faster R-CNN, etc.). So in this article shows how YOLOv3 works and how to use Yolov3 in Object detection using YOLOv3-tiny for testing, because YOLOv3-tiny has less memory and has the fastest speed from all Yolo. The experiment show YOLOv3-tiny can detect objects quickly and with sufficient precision.

Reference

--

--