Object Detection using YOLO

5 min readJul 26, 2022

In this article, I am sharing a step-by-step methodology to build a simple object detector using YOLO and a webcam feed from your laptop to identify a specific object.

Note:Having a basic knowledge of the YOLO model would be helpful to understand this article even better. Also, the code is written in Python.

What is YOLO?

YOLO, an acronym for ‘You Only Look Once’ (inspired by the famous quote ‘You Only Live Once’)

YOLO is an object detection algorithm that uses convolutional neural networks (CNNs) to detect objects in real time. As the name suggests, it is a single-stage object detection model which requires only a single forward propagation through a neural network to detect objects. To learn more about YOLO and its history, I found this article from pyimagesearch quite helpful but you can refer to many other useful resources available online.

Initial requirements

For this project, we will NOT be training the model from scratch but instead, use pre-trained configuration and weights of YOLOv4. In order to do that, the following files will need to be downloaded first:

Config File: Model architecture is stored in this file.
Weights File: Pre-trained model weights. These were trained using the DarkNet code base on the MS COCO dataset.
Classes file: Contains the name of the 80 object categories in the COCO dataset.

Now, we complete the initial setup by loading the YOLO model using the OpenCV DNN function cv2.dnn.readNetFromDarknet. Then, we read the classes file coco.names which contains the list of 80 different object classes on which the YOLO model is trained, and stores them in a python list. Next, we define the object that needs to be detected. In this particular case, we want to detect a cell phone from a camera feed.

Initial configuration and set-up

Object detection function

This is where the ‘magic’ happens. Next, we define an object detection function. This function takes the image frame from your camera feed and then detects the intended object. The code snippet with the camera feed will be explained later. But for now, let’s just assume that this function receives a single image as input.

First, we get the height and width of the input image. Then, we need to transform the image into a blob (which is a 4D NumPy array object — images, channels, width, height) using cv2.dnn.blobFromImage function. This is required to prepare the input image in the required format for the model intake. To learn more about what is blob and how the cv2.dnn.blobFromImage function works, refer to this blog. The input parameters of this function depend on the model that is being used. For YOLO, the following parameters are used:

the image to transform
the scale factor (1/255 to scale the pixel values to [0..1])
the size, here a 416x416 square image
the mean value (default=0)
the option swapBR=True (since OpenCV uses BGR)

The blob object is then set as input to the network and a forward pass is performed through the YOLO network after determining the output layers from the YOLO model.

Object detection function

We also need to visualize the results after the detection of the object. But first, let’s initialize a few lists to store the required information in order to so:

boxes: Bounding boxes around the object.
confidences : The confidence score that the YOLO model assigns to an object. We set the minimum probability score 0.5 for filtering out weak detections. Lower confidence values indicate that the object might not be what the network thinks it is.
class_ids : The label of the detected object class.

Next, we loop over each of the layeroutput and then loop over each detection in output . Then extract the confidence scores for all the object classes from the 5th element in the detection list and select the class id of the object with the maximum confidence score. If the detected class id is equal to that of the desired object (i.e. cell phone in this example) and the confidence score is greater than the threshold (to filter out weak detections), we try to visualize the desired detected object by drawing a bounding box and adding the label of that object.

YOLO model returns the center (x,y) coordinates of the bounding box followed by the box’s width and height. But before we can actually use them, we need to first scale these values relative to the size of the image. After scaling, we use the center coordinates, width, and height of the bounding box to derive the top-left coordinates of the bounding box. Then update the boxes , confidences and class_ids lists.

YOLO does not apply non-maxima suppression by default, so we need to explicitly apply it using the cv2.dnn.NMSBoxes function. This function simply suppress significantly overlapping bounding boxes, keeping only the confident ones while excluding any redundant boxes. The input parameters are boxes, confidences, confidence threshold (i.e. 0.5 ) and NMS threshold.

Assuming that the intended object (i.e. cell phone in this example) has been detected, we loop over the indexes determined by non-maxima suppression to draw the bounding box and text on the image using random class colors. Then, finally, we display our resulting image.

Capture video from the camera and identify the intended object

As the objective of this project is to identify a specific object from a camera feed, we first need to capture live streams with the camera (webcam). In order to do so, we need to create an object of the VideoCapture class from the OpenCV library. As an input, the VideoCapture class receives an index of the device we want to use. If we have a single camera connected to the computer, we can pass a value of 0.

After this, inside a while loop, we can start reading a Video from the camera frame by frame. We use the read method on the VideoCapture object to read each frame. This method takes no arguments and returns a tuple. The first returned value is a Boolean indicating if a frame was read correctly (True) or not (False) and the second one is the frame from the camera. Next, we pass each frame to our object detection function imgRead . If the intended object (i.e a cell phone in this example) appears in the camera feed, then a bounding box along with the label of the object will be around it (something like the image shown below). The cv2.imshow() method displays a video or image in a window.

Now, if we want to close the camera feed, we can simply press the key q on the keyboard.

References:

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

Object Detection using YOLO

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Tauseef Ahmad