Introduction to Deep Learning Object Detection.

11 min readApr 10, 2022

Object detection is the technique of identifying the location of the object present in the image and classify the objects present in that location. The input for object detection model is images with annotation files (bounding box co-ordinate and its class label).The output of the trained object detection model is image enclosed with rectangular boxes along with class label and its confidence. Animal detection, plant disease detection, fruit counting, face detection are some examples for object detection task.

Examples (models) - R-CNN,YOLO, SSD etc

source : Blog from Gaurav Sinha | Towards Data Science

Basic concept of object detection

A simple way of object detection using CNN includes, spitting the image into multiple segments and running classification model over each segment. After classifying the correct objects in the segment, we just need to merge all segments to get original image with detected objects in it.

Above discussed approach is a simple object detection process, but it has some problems like different aspect ratios of image, object size, time computation , number of segments to select etc.
To overcome this issue, many optimized approaches are been developed in recent past years. Some of the well know models are Faster R CNN, SSD,YOLO

Object detection generally is categorized into 2 stages:

Single-stage object detectors: In this type of object detection process, the objects with in the images are detected in a single process.(YOLO and SSD)
Two-stage object detectors: Here the object detection is done in 2 stage process.( R CNN, Faster R CNN)

What are base models?

Base models are CNN based architectures consisting of a combination of convolutional and pooling layers which are used to extract only the relevant features from input image.
Base models just extracts important information from image by passing filters over the image. It ignores non important information, thus reduces the vector size.

How it is useful in classification and object detection task?

Instead of having whole image, what if we have image only with required information. It will be quick for models, to detect the objects with high accuracy right?. Yes, Base models creates a featured map (featured image) from input image. Then this featured image is fed to state of art models for better detection.
Below are some of the base models..

Mobilenet

Mobilenet is the deep neural network architecture consisting of a combination of convolutional and pooling layers. The purpose of the base model is to extract the features from the image. This is a lightweight CNN based architecture developed by google researchers to increase the efficiency of applications in mobile. They used a concept of depth wise separable filters. In normal convolution, the kernel is applied across all the depth of input to this convolution layer. The mobilenet introduces the concept of applying convolution in two stages. In the first stage, filters first applied on spatial input and then in the second stage, the point wise convolution is performed.

Resnet

ResNet is also a deep neural network architecture that uses convolutional operation . It is also used as a feature extractor. It is also called Residual networks. The skip connections within the model are introduced in ResNet architecture. It has multiple convolution and pooling operation. Skip connection helps in improving the learning of the architecture. Skip connections overcome the problem of vanishing gradients during back propagation. ResNet is most popularly used for image classification and object detection task. In ResNet-50 the total number of layers are 50. Similarly for ResNet-101 and so on. Skip connection helps the network to understand global features.

Darknet is a CNN based framework used for extracting the important features for the YOLO model. Darknet is also a combination of convolution and pooling layer. Darknet 53–53 means the total layers it has. Figure shows the model architecture. It uses 1x1 and 3x3 convolutional operations with skip connections.

YOLO

YOLO is an object detection model, which uses convolutional neural networks (CNN). YOLO was introduced by Joseph and Ali (2018). They developed an updated model called YOLO v3 which detects the object in 3 different stages of the layer. The algorithm is designed in such a way that it performs localization and classification of the object in only one pass. Hence the name You only look once. It is the latest version that uses convolutional layers (3x3 and 1x1). The yolo v3 architecture has 53 layers. There is normalization of an activation function (Leaky ReLU) between every successive convolutional layer

During training the YOLO algorithm, the basic concept is that the input image is split into a number of grids, let’s say M × M grid. Then these grid cells will be associated with a predefined set of anchor boxes (default boxes with different aspect ratio). These anchors fit over the grid. If any object is present in the grid, then these anchor boxes will predict the output bounding box along with its confidence score.

As shown in the below figure every grid cell predicts the objectness score with its class and its location. As a result of which each object gets multiple bounding boxes with different confidence scores. The technique which is used to remove multiple bounding boxes from the object is called non-maximum suppression. It only keeps the box which has a higher confidence score and discards the remaining ones.

During training, the model refines anchor box size and its prediction (parameters), so that loss function is minimum.

Above figure shows how objects gets detected using anchor boxes in yolo

It has been shown that the input image is split to 9 grids. Then each grid is associated with a set of boxes. In this task we have given 2 boxes. Each box will predict the bounding box and its class confidence. Box with higher score from the grid is considered for further process. The output vector mainly depends on the number of anchor boxes and number of classes used. Here for explanation purposes, we have used 3 classes and 2 boxes. So the final output vector consists of 16 values.

Here pc is the objectness score, bx, by, bh, bw are bounding box coordinates and c1,c2 and c3 are classes

As shown in the above image, the yolo v3 model is designed to extract 3 different feature maps at different aspect ratios. These 3 feature maps are fed to yolo v3 for object detection tasks. This 3 stage detection is used to detect small, medium and large objects separately. Yolo v3 detects multiple objects, predicts classes and identifies locations. Due to its architecture design yolo v3 is able to detect small objects very well in the image.

The base model used by the author for YOLO object detection is the darknet framework. As previously explained the yolo v3 consists of 53 convolutional layers. In addition to the above layer Darknet 53 adds 53 layers to the YOLO architecture, which gives 106 layers of convolution operation.

Single shot Multi box detector (SSD)

The SSD framework was first introduced by Liu et al. (2015).It is a deep learning algorithm that uses CNN for object localization and classification. Similar to YOLO, this architecture is designed to take an image and process it in a single pass. The speed of the architecture compared to Faster R CNN is fast. So this model can be utilized for instantaneous detection. Likewise other object detection models, SSD also require a base model to extract required features from the images. So a lightweight Mobilenet architecture is used in this task.

In SSD, an image is passed through a single deep convolutional neural network. Convolution feature map is created at a different scale as shown in the above figure. These different feature maps are used to detect objects of different sizes. Objects with larger size are detected by deeper layers and smaller objects are located and classified by initial layers.

The term Single Shot implies that both localization and classification are performed in a single pass of an image through the network. As the image propagates through the SSD layers, feature maps of the input images are extracted at different aspect ratios. As a result of this process, we get different feature maps. A set of boxes are allowed to fit over these feature maps. The official research paper includes 6 feature maps with different aspect ratios. These boxes are like anchor boxes in YOLO.

During training the model, the refining of the default boxes is done based on loss function such that loss function is minimum and default boxes perfectly overlap the actual bounding box in the training dataset.

During the training of SSD networks, images with actual bounding boxes are provided to the network. The predefined boxes try to match with actual bounding boxes during the training process. These default boxes will have different aspect ratio and sizes as shown

The number of default boxes can be different for each layer. Initial layers have more default boxes, and the later layers have smaller numbers of default boxes. For each default box, this method finds out the shape offset and the probability score.

The overall loss function of SSD depends on both localization and classification loss

Total loss

Classification loss

Localization loss (regression loss)

xijp is an indicator of ith anchor box that is matched with jth actual bounding box . Total number of matched bounding boxes is N. Actual bounding box parameter is g and the predicted bounding box parameter is l. Class confidence is represented with symbol c.

Faster R CNN

Faster R CNN is a deep learning based object detection architecture. It was first introduced by Ross et al (2015) . This model uses convolutional neural networks for localization and classification tasks. This algorithm uses a 2 step process to localize and classify the objects. Faster R CNN uses Region Proposal Network ( RPN). It is the layer which is used to localize objects in an image irrespective of its class. Feature map from feature extractor is fed to RPN, which in return, RPN gives object location and its confidence score.

RPN takes an image input and generates rectangular regions within the image known as object proposal or regional proposals. After extracting features, a sliding network architecture was made to slide over the feature map. These sliding windows consist of anchor boxes which are responsible for generating object proposals.

Anchors are nothing but a set of reference boxes that indicates the possible objects. As shown in the below figure, multiple anchors with different aspect ratio are placed in the window. The network moves through each pixel, in these K anchors and refines the coordinates of the anchors to generate Region of Interest or Object proposals.

Convolution feature maps generated by base model become input of RPN. A sliding window of size n × n ( 3 × 3 in Figure ) is applied on this feature map in a convolution layer manner. Fixed number of default boxes are created for each position of this sliding window. After this, two parallel convolution layers are created by using 1×1 kernel:

1) Regression layer

2) Classification layer.

For the regression layer, the depth will be 4K, considering K is the number of default boxes. Each default box has four parameters [cx, cy, height, width]. For the classification layer, the depth will be 2K as each default box will have two scores, one for foreground and second for background.

The total loss is the combination of classification loss and regression loss.

Above equation is for the set of anchors (i) used during training a model. L𝒸ₗₛ(pᵢ, pᵢ*) indicates classification loss.

pi the probability of the anchor i predicted to be an object. pi∗ is the actual value (0 for negative and 1 for positive).

Lreg(𝑡i, 𝑡i*) indicates regression loss. Is calculated only for positive ground-truth, i.e, only when the anchor contains an object

[𝑡x,y,𝑡w,𝑡h] — Predicted bounding box coordinates

[𝑡x* ,y* ,𝑡w* ,𝑡h* ] — Actual bounding box coordinates

Where,

Here x and y represent the coordinates of the bounding box and, h is height and w is and width of the bounding box. xₐ stands for coordinates of anchor box whereas x* is actual bounding box coordinates for anchor.

Figure 3.10 shows a complete description of the detection process. The steps are as follows:

● Input image (Eg : 640x640x3) is fed to feature extractor(base models)

● These featured image is fed to RPN

● RPN uses sliding window to generate object proposals and its box coordinates

● These proposals are fed to ROI pooling layer, where the proposed regions are cropped and resized

● Then finally the cropped image is fed to the classifier and regressor to predict class and its location.

Evaluation metrics for object detection

Intersection Over Union (IOU)

IOU is an evaluation metric commonly used in object detection to find the difference between actual ground truth bounding boxes and predicted boxes. Usually object detection algorithms use this metric both for training and testing purposes. This metric helps in eliminating the multiple boxes over a single object based on confidence score and overlap over ground truth bounding boxes.

This metric helps in calculating precision of the model

Average Precision (AP) and Mean Average Precision (mAP)

Precision refers to how accurate our prediction is. Generally, precision is calculated at a particular decision threshold (IOU). The mean of all possible precision values for IOU ranging from 0 to 1 is called Average Precision. AP value ranges from 0 to 1, where 0 indicates poor performance and 1 indicates best predictive model. The value of AP is equal to the area under the curve. In Average precision, we only calculate precision for individual objects but in mAP, we find AP for all objects and take the mean of it.

If we have multiple classes to detect, then the mean of average precision for all classes is taken as Mean Average precision (mAP). mAP@0.5 means that precision for at least 50 %. Similarly mAP@0.75 at IOU threshold 0.75. Generally mAP@50 is sufficient to indicate the performance. For more accurate prediction the mAP of all possible thresholds ranging from 0.5 to 0.95 in a step of 0.5 is calculated. Then the mean of these mAP at different thresholds gives the final mAP.

Please a clap/follow if the content is useful
Thank you