Introduction to Object Detection and Overview of Deep Learning Models

Published in

Embla Tech

7 min readSep 21, 2022

This article will explain what object detection is and give an overview of the latest computer vision object detection algorithms. Object detection is essential to artificial intelligence because it lets computers see their surroundings by finding objects in visuals.

What is Object Detection? Why is it important?

Object detection is a technique of computer vision that helps detect and recognize things in images or footage. In particular, object detection draws bounding boxes around the objects it finds and lets us figure out where these objects are in a scene or how they move through it.

Object detection is one of the fundamental problems of computer vision. It forms the basis of many other downstream computer vision tasks, for instance, segmentation, image captioning, object tracking, etc. Specific object detection applications include pedestrian detection, people counting, face detection, text detection, pose detection, or number-plate recognition.

Object Detection vs Image Classification

In computer vision, most of us aren’t sure what the difference is between image classification and object detection.

Take a look at the pictures below:

You will have known it right away. It’s a dog. Step back and think about how you got to this decision. You looked at a picture and decided which class it belonged to (a dog, in this instance). And that’s pretty much all there is to Image Classification. Image Classification helps us to classify what is contained in an image.

As you saw, there’s only one object here: a dog. We can easily use the image classification model and predict that there’s a dog in the given image.

But what if we have both a cat and a dog in a single image?

In that case, we can train a multi-label classifier. Now, there’s one more catch: we won’t know where either animal or thing in the picture is.

Image Localization comes into play at this point. Image Localization will specify the location of a single object in an image.

We use the Object Detection concept when there is more than one object in the visual. Object detection lets us guess where each object is and what it is.

How Does Object Detection Work?

Detecting objects is not as new as it might seem. Over the past 20 years, object detection has changed. The history of object detection can be split into two parts: before and after Deep Learning was invented.

Traditional Object Detection period — Generally doesn’t require historical data for training and is unsupervised in nature. OpenCV is a popular tool for image processing tasks.

Ex: Viola-Jones Detector, HOG Detector (popular feature descriptor for object detection in CV and image processing), & DPM

Deep Learning Detection period — Most computer vision tasks use either supervised or unsupervised learning, with supervised methods being the norm. The performance is limited by how fast GPUs can do calculations, which is getting better and better every year.

Supervised Learning : In Supervised Learning, the machine learns under supervision. It contains a model that is able to predict with the help of a labeled dataset. A labeled dataset is one where you already know the target answer.
Unsupervised Learning : In Unsupervised Learning, the machine uses unlabeled data and learns on itself without any supervision. The machine tries to find a pattern in the unlabeled data and gives a response.

Object detection generally is categorized into 2 stages:

Single-stage object detectors: A single-stage detector removes the RoI extraction process and directly classifies and regresses the candidate anchor boxes.
Two-stage object detectors: Two-stage detectors divide the object detection task into two stages: extract RoIs (Region of interest), then classify and regress the RoIs.

Object Detection Model Architecture Overview

Here’s a brief summary of family models used in object detection.

R-CNN Model Family (Region Convolutional Neural Networks)

The R-CNN Model family includes the following:

R-CNN — This utilizes a selective search method to locate RoIs in the input images and uses a DCN (Deep Convolutional Neural Network)-based region-wise classifier to classify the RoIs independently.
SPPNet and Fast R-CNN — This is an improved version of R-CNN that deals with extracting the RoIs from the feature maps. This was found to be much faster than the conventional R-CNN architecture.
Faster R-CNN — This is an improved version of Fast R-CNN that was trained end to end by introducing RPN (region proposal network). An RPN is a network utilized in generating RoIs by regressing the anchor boxes. Hence, the anchor boxes are then used in the object detection task.
Mask R-CNN adds a mask prediction branch to the Faster R-CNN, which can detect objects and predict their masks simultaneously.
R-FCN replaces the fully connected layers with position-sensitive score maps for better detecting objects.
Cascade R-CNN addresses the problem of overfitting at training and quality mismatch at inference by training a sequence of detectors with increasing IoU thresholds.

YOLO Model Family (You Only Look Once)

The YOLO family model includes the following:

YOLO uses fewer anchor boxes (divide the input image into an S × S grid) to do regression and classification. This was built using darknet neural networks.
YOLOv2 improves the performance by using more anchor boxes and a new bounding box regression method.
YOLOv3 is an enhanced version of the v2 variant with a deeper feature detector network and minor representational changes. YOLOv3 has relatively speedy inference times, taking roughly 30ms per inference.
YOLOv4 (YOLOv3 upgrade) breaks the object detection task into two pieces, regression to identify object positioning via bounding boxes and classification to determine the object’s class.
YOLOv5 is an improved version of YOLOv4 with a mosaic augmentation technique for increasing the general performance of YOLOv4.
YOLOv6 makes up for impressive capabilities in small object detection in densely packed environments, and YOLOv6 defines the model parameters directly in Python.
YOLOv7 is the new state-of-the-art object detector in the YOLO family, and it is the fastest and most accurate real-time object detector to date. YOLOv7 has reduced the number of parameters to 40% & computation to 50% compared to the base models.

CenterNet Model Family

The CenterNet family model includes the following:

SSD places anchor boxes densely over an input image and uses features from different convolutional layers to regress and classify the anchor boxes.
DSSD introduces a deconvolution module into SSD to combine low-level and high-level features. While R-SSD uses pooling and deconvolution operations in different feature layers to combine low-level and high-level features.
RON proposes a reverse connection and an objectness before extracting multiscale features effectively.
RefineDet refines the anchor boxes' locations and sizes twice, inheriting the merits of both one- and two-stage approaches.
CornerNet is another keypoint-based approach, which directly detects an object using a pair of corners. Although CornerNet achieves high performance, it still has more room to improve.
CenterNet explores the visual patterns within each bounding box. This technique uses a triplet, rather than a pair, of key points for detecting an object. CenterNet evaluates objects as single points by predicting the x and y coordinates of the object’s center and its area of coverage (width and height). It is a unique technique that outperforms variants like the SSD and R-CNN families.

Object Detection Applications

Face and person detection

Face detection is one of the most common ways to use object detection, and most of us already do it when we use our face to unlock our phones. Person detection is also commonly used to count the number of people in retail stores or ensure social distancing metrics.

Autonomous vehicles

Object detection is used by self-driving cars to find and avoid people, other cars, and obstacles on the road. Autonomous vehicles with LIDAR will sometimes use 3D object detection by placing cuboids around objects.

Defect Inspection

Manufacturing companies can use object detection to spot defects in the production line. Neural networks can be trained to detect minute defects, from folds in fabric to dents or flashes in injection moulded plastics. Unlike traditional machine learning approaches, deep learning-based object detection can also spot defects in heavily varying objects, such as food.