Object Detection With Deep Learning For Computer Vision

Aishwarya Deshmane
Modali Consulting
Published in
11 min readJul 20, 2023

This article gives you a basic knowledge of computer vision with extra focus on object detection and how to build a YOLO model for it. Everything is explained in layman’s term and tried to keep it simple.

The article is divided into 3 sections:

I. Basics of computer vision

II. Basics of YOLO model

III. YOLO model implementation with chess pieces

SECTION I : BASICS OF COMPUTER VISION

Computer vision is all about you passing an image/image dataset/video to the neural network and getting in return classified, detecting or segmented image/video. It tries to classify, localize a single or multiple object that you are interested in.

Say you have images, and every image has single object to be identified. Every input image will have only 1 single output, this is called multi-class image classification. Now you have images with many with multiple objects in it to be identified and every input image gives you multiple outputs of the classes, this is called multi-label image classification.

In this section, let’s learn about:

1. Image classification vs Object detection vs Image segmentation

2. Bounding box representation

3. Evaluation metrics

  1. Image classification, Object detection and Image segmentation:
Image source

Image classification — What object is in the image. This will give class labels as the outputs

Object detection — What object it is and where it is in the image. While detecting the objects, it localizes the object. This will give class labels plus the bounding box coordinates with probability scores as the output

Image segmentation — What object it is, pixel-wise where it is in the image. Here the model goes through the image pixel by pixel and classifies + segments the object. It is always good to have a restricted or finite set of labels in here

Basically, model goes through every image that has been passed into small segments, tries to identify all the objects from the image, plots the bounding box around the objects and detects the object from that.

2. Bounding box representation:

Bounding box is nothing but the XY coordinates of the interested object(s) in a particular image.

The image coordinates in computer vision most of the time starts from top left corner as shown in the image. It is always important to understand that every model follows different format to represent its bounding box.

Image source

There are many huge datasets in the market like Pascal VOC, COCO, imagenet, CSPDarknet etc. with thousands or millions of images and the models are usually trained on these huge datasets to increase model accuracy. Few datasets and models use normalized bounding box values while few use raw. To get an appropriate result, focusing on this is a must.

Let’s consider the image shown here. It has bounding box for the object-cat that gives the coordinates of top left corner and bottom right corner i.e. (Xmin, Ymin, Xmax, Ymax) format.

Below are a few examples on how the bounding box formatting would be changing with dataset or model used[2]:

  • The Pascal VOC dataset and Fast.ai model uses (Xmin, Ymin, Xmax, Ymax) format for bounding box. The bounding box for cat would be (98, 345, 420, 462)
  • Albumentations uses normalized (Xmin, Ymin, Xmax, Ymax) format. To get the normalized values, Xmin and Xmax should be divided by the width of the image, Ymin and Ymax with the height. So, the bounding box here would be (98/640, 345/480, 420/640, 462/480) which are (0.153125, 0.71875, 0.65625, 0.9625)
  • COCO dataset uses (Xmin, Ymin, width, height) format and it means width and height of the object (width = Xmax-Xmin, height = Ymax-Ymin), which looks like (98, 345, 322, 117)
  • YOLO represents its bounding box in normalized (Xcenter, Ycenter, width, height) format. When we say normalized, the values should always lie between 0 and 1. (((420+98)/2)/640, ((462+345)/2)/480, 322/640, 117/480) which are (0.4046875, 0.840625, 0.503125, 0.24375). Because of YOLO’s bounding box format, resizing the image would not affect the detection.

3. Evaluation metrics:

In the tasks like this, scores like Mean Average Precision (mAP), precision, recall and F1 matter a lot and accuracy is less important.

Precision scores focus on how accurate the predictions are i.e. percentage of correctly identified predictions. Recall is how good one finds all the positives.

TP+FP = Predicted positives, TP+FN = Actual positives

Average Precision (AP) or mean Average Precision (mAP) shows how accurate the detections within the bounding box are. It is very popular metric to measure how strong and robust the model is. It is considered a very good metric to compare different models. It is based on Precision-Recall (PR), it calculates the area under the PR curve. It plots precision against recall for different thresholds and the scores/assessment is more balanced because we’re considering area under the curve (AUC). This calculates Average precision for all the labels separately, then takes the average of all the categories, hence called mAP.

Below is the formula for AP and n is the thresholds. mAP can be calculated by diving the result by total number of classes [15]

Intersection over Union (IoU):

How can we decide whether the predicted bounding box is giving us a good outcome (or a bad one)? How accurate or good the bounding box is? This is where Intersection over Union comes into the picture. It calculates the intersection over union of the actual bounding box (red box in the image below) and the predicted bounding box (blue box). Let’s consider the actual and predicted bounding boxes for a car (easy to explain for big objects) as shown below[3]:

Image source

If IoU is greater than 0.5, one can say that the detection is good enough [14]. The IoU can be changed from the default value. Greater the IoU, smaller the mAP and more accurate the results.

Usually, mAP the scores are represented as 90% mAP50 or mAP@0.5 or mAP (IoU=0.5), which can be read as 90% mAP at IoU=0.5

Non-max suppression:

The object might be detected multiple times instead of just once and is very common problems with object detection. This is where non-max suppression comes in picture, it filters out the redundant boxes. Let’s consider the below example[3]:

Image source

Here, the cars are identified more than once and non-max suppression cleans this mess. This helps providing just single detection per object and getting rid of the redundancy. It looks at the probabilities associated with each detection and takes the one with largest probability and suppresses the others.

We have a basic understanding to run the YOLO model now!!

SECTION II : BASICS OF YOLO MODEL

YOLO is a deep learning model. Usually, the models use 2 passes to go through the neural network and looks twice at the image — one for the classification and other to get the bounding box coordinates. YOLO stands for You Only Look Once which means the model goes through the neural network and look at the image only once. In one go you get the labels and the bounding box coordinates for it, there is no need of the second pass. This reduced the computation time enormously and increased the speed and accuracy, YOLO, the state-of-the-art (SOTA) become very popular. State-of-the-art or SOTA is a model which is the best among the others at performing a given task

This article implements the YOLOv8 model out of its many versions, though the concepts would remain the same.

This section contains-

1. Bounding box representation in YOLO

2. Concepts YOLO uses to increase its accuracy

3. Data saving format

4. Models and Sub-models

1. Bounding box format:

As discussed above, YOLO also has different format to represent the bounding boxes and the format is normalized ‘Xcenter Ycenter Width Height’. Since it takes all this values in normalized format, resizing the image would not affect the detection.

2. Concepts YOLO uses to increase its accuracy:

YOLO uses Intersection over union (IoU) and non-max suppression (NMS) concepts which are very important and powerful.

3. Saving the data:

YOLO needs yaml file and dataset folders. The yaml file contains path to train, valid and test folders with total number of classes (nc) and name of classes (names)

YOLO has a unique way of handling its files and folder and taking inputs from it. File structure has to be in a particular manner, and one should stick to it. The format of saving dataset with images and its labels is:

Train → “images”, “labels” folders

Valid → “images”, “labels” folders

Test → “images”, “labels” folders

It takes all the images from images folder and labels in .txt files from labels folder. One just need to mention the image folder path. Using the same folder names is mandatory as it assumes, labels folder must be in the same folder as images, and it finds for ‘labels’. It needs a separate text file for each label with the same name as the image file. The text file must contain:

“Label bounding_box” → “1 Xcenter Ycenter Width Height”

details per line per defect for that particular image, where 1 is label. YOLO also needs yaml file where the path to train, validation and test image data and class details.

4. Models and Sub-models in YOLO:

There are 3 types of models in YOLO — classify, detect and segment. It has 5 different sized models in each[6]:

The bigger the model, slower to train it but better the performance. The model can be chosen based on the number of parameters

SECTION III: YOLO MODEL IMPLEMENTATION FOR OBJECT DETECTION

For ease of understanding, let’s build a model for multi-label object detection of chess pieces, the dataset is downloaded from Roboflow[5]. Train dataset contains 606, Valid dataset contains 58 and Test dataset contains 29 images with 12 labels/classes.

Use this GitHub link to find the code.

All you need is install ultralytics library and use these line of codes, it’s that simple to train the YOLO model.

The YOLO model here is trained using the pre-trained model “yolov8m.pt” where m stands for medium version of YOLOv8 and pt stands for pre-trained model.

  • Results:

Look at the results above, this looks like a perfect output for multi-label object detection. Once the model is trained, it detects the objects in the image and provides the bounding box and the probability/confidence for it.

This is what the fine-tuning a model for epochs looks like:

The results are saved to runs/detect/train. Results (.pt file) contains weights of the model that we have run and can be used to train the same kind of dataset later. If someone wishes to stop the training, just save the .pt file for last epoch and use it to resume the training later, isn’t that great!!

Training the model is quite easy, but understanding the inference is very important:

Epoch: Indicates current epoch out of total number of epochs.

GPU_mem: This shows GPU memory usage during that epoch.

box_loss: How well the model is predicting the coordinates of the bounding boxes. Lower the values, higher the performance

cls_loss: How well the model is classifying objects within bounding boxes. Lower the values, higher the performance

dfl_loss: This loss increases when the data is imbalanced. Let’s say that a dataset has 10 images with 100 houses and only 5 trees. Whenever the model tries to detect a tree, the dfl_loss would increase because the dataset is imbalanced, and the model is not well trained to detect the trees. In fancy and technical words, this shows the loss value for the deformable convolutional networks

Instances: This indicates the total number of objects/bounding box detected by the model in that epoch

Size: This refers to the size or the image resolution of the input images

The progress bar provides a visual representation of the completion percentage of the epoch. Each iteration processes a batch of data, and the progress bar shows the progress made so far.

The next section displays the metrics for the model’s performance on the validation dataset. It provides an evaluation of how well the model is performing on unseen data:

Class: Specifies the class of objects being detected. In this case, it is set to “all” to represent all classes combined.

Images: Represents the number of images in the validation dataset. In the given example, there are 58 images in the validation set.

Instances: Indicates the total number of instances (bounding boxes) in the validation dataset. In this case, there are 386 instances.

Box(P): Represents the precision (P) for bounding box predictions. Precision measures the proportion of correctly predicted bounding boxes out of all predicted bounding boxes.

R: Refers to the recall (R) for bounding box predictions. Recall measures the proportion of correctly predicted bounding boxes out of all ground-truth bounding boxes.

mAP50: Represents the mean average precision (mAP) at an IoU (Intersection over Union) threshold of 0.5. mAP is a commonly used metric to evaluate object detection models, and it combines both precision and recall across different IoU thresholds.

mAP50–95: Represents the mean average precision (mAP) across IoU thresholds from 0.5 to 0.95. This provides a broader evaluation of the model’s performance at various levels of IoU.

At the end of the training, we could get a mAP50 of 98.7% which is very satisfying. We can also see the instances, Precision, Recall and mAP score for the whole validation set and every label separately.

This model is easy to run, speed is very good, and accuracy is very high because the architecture behind is great.

Hope I could help you get started with object detection using YOLO.

CITATIONS:

1.https://docs.ultralytics.com

2.https://albumentations.ai/docs/getting_started/bounding_boxes_augmentation/

3.https://www.analyticsvidhya.com/blog/2018/12/practical-guide-object-detection-yolo-framewor-python/

4.https://hasty.ai/docs/mp-wiki/metrics/average-precision

5.https://roboflow.com/

6.https://www.freecodecamp.org/news/how-to-detect-objects-in-images-using-yolov8/

7.GitHub link to my repo

--

--