Object Detection State of the Art 2022

11 min readJun 8, 2022

Object detection has been a hot topic ever since the boom of Deep Learning techniques. This article goes over the most recent state of the art object detectors.

First we will start with an introduction to the topic of object detection itself and it’s key metrics.

The evolution of object detectors began with Viola Jones detector which was used for detection in real-time. Traditionally, object detection algorithms used hand-crafted features to capture relevant information from images and a structured classifier to deal with spatial structures.

Example of Object detection in an urban scenario. Image Credit by : https://17bce011.medium.com/complete-guide-to-object-detection-using-deep-learning-23ffc99ab072

However, these traditional approaches are not able to fully exploit the extremely large data volume and deal with endless variations of object appearance and shape. Even though they do not require historical data for training and are unsupervised in nature, these techniques have many restrictions, especially when dealing with complex scenarios like illumination effect, occlusion effect and clutter effect. The new era of object detection is built uppon Deep Learning techniques.

When distinguishing traditional machine learning from Deep Learning, in the case of computer vision, we can say that Machine Learning extracts hand-crafted features from images and performs a classification, on the other hand, Deep Learning techniques extract the features and classify them in one step.

To fully understand the impact of the new techniques some key performance metrics used in object detection need to be introduced first.

Key Metrics:

To establish a fair comparison between different detectors many metrics were defined over the years, the most dominant one being mean average precision (mAP). A brief introduction to other metrics is necessary to fully understand mAP so an explanation on those is provided.

Definition of Terms:

True Positive (TP) — Correct detection
False Positive (FP) — Incorrect detection
False Negative (FN) — A Ground-truth missed (not detected) by the detector

IoU

IoU metric evaluates the division between the area of overlap and the area of union. In other words, it evaluates the degree of overlap between ground truth (gt) and predictions (pd). It ranges from 0 to 1, where 1 would be a perfect overlap between the ground truth and the prediction.

Image Credit :Object detection in autonomous vehicles: Status and open challenges. 2022

Precision and Recall

Precision attempts to answer the question “What proportion of positive identifications was correct? Recall relates the proportion of actual positives that were correctly identified.

So, for example, analysing a vehicle object detector, the precision contain information of “from all the cars predicted by the model, how many of them were actually cars”? And recall would say “how many cars the model identified”

Average Precision (aP)

When plotting the precision recall curve evaluated at an IoU threshold, we get the Average Precision

Where “α” is the threshold value, p the precision and r the recall.

Mean Average Precision (mAP)

When considering a multi-class object detector the mAP gives us the mean AP across all classes.

Where “AP” is the average precision of each “q” class and “Q” is the number of classes. In recent years, object detection performance has improved significantly, with an increase from 30 % mean average precision (mAP) to more than 90% in 2018 on the PASCAL VOC benchmark (measured on the PASCAL VOC object detection public dataset). The main driver of these improved results was the use of Deep Learning. With the development of new technologies that allow faster and easier pipelines and the availability of large-scale open data sets, the development of powerful models has become a reality.

2D vs 3D Detectors

Both 2D and 3D object detectors are important depending on the context of the problem. For example, when talking about self-driving cars both 2D and 3D bounding boxes are used for car perception. 2D object detectors give bounding boxes with Four Degrees of Freedom (DOF) in the form of [xmin,ymin,xmax,ymax] .

This gives information of the position of the objects in the 2D plane but lacks information about the depth of the object. This information is crucial to predict important factors like shape, size and position of the objects. While 2D detectors use input from images, 3D detectors use data from camera, LIDAR or radar to generate 3D bounding boxes. There are multiple approaches to do so, but some of these detectors take advantage of their 2D counterparts and project LIDAR/radar information, fusing it in order to obtain depth and a 3D bounding box like. Lately, some authors have been leveraging monocular image-based to perform 2D to 3D lifting and create 3D object detection results. A taxonomy summary can be seen here:

Taxonomy of object detectors with some example models (Adapted from [1]).

In this blog post we will be focusing on 2D object detection.

Types of Object Detectors

Object Detection tries to answer the question, ’Are there any objects in the image?’ If yes, “what objects and where?”. We can divide object detectors into two main types: One-stage and Two-stage detectors. Object detectors extract features from the input image/video frame in two main tasks; first they find the objects (and their bounding boxes) then they classify them. The architecture of both types can be seen in the next Figure. In a two staged architecture these steps are separated; first it gets an object region proposal then classifies it based on the features extracted from the region proposed. These architectures achieve a very high accuracy rate but are rather slow which makes them unfit for real-time applications like self-driving vehicles.

Architecture of Object Detector(Adapted from YoloV4 Paper).

Some examples of two-stage detectors include RCNN , Fast-RCNN and MaskRCNN . Usually, object detectors are made with a Backbone. This is typically a network that acts as a feature generation network for object detection . CNNs are usually chosen for this task. To achieve the best accuracy and efficiency, when building a model, we must choose the most adequate backbone. More accurate backbones are generally tied to how deep and dense they are, some example of backbones are ResNet and ResNetXt. On the other hand, if inference speed or computational power is a concern, a lightweight backbone might be a better choice, especially in mobile applications. When referring to real-time detection systems, often we must adapt the detection backbone and make a fair trade-off between accuracy and speed.

In short, deeper and densely connected backbones replace the shallower and sparsely connected ones to obtain more detection accuracy.

One stage detector predicts the bounding box over the images without region proposal step, achieving greater detection speeds. A sample architecture of each type of detector can be seen in the Figure bellow; here we can see that the ROI generation step differes from two-staged to one-stage.

Two-staged vs One-stage Detectors Diagram [1].

Some examples of one-stage detectors are YOLO , SSD and RetinaNet EfficientDet-D7.

Comparison between different 2D-3D models of object detection [1].

The next Figure shows an overview of the various real-time top performing models in terms of mAP at the COCO dataset each year. COCO is a multi class object detection dataset composed of 80 classes. It’s the current standard dataset benchmark to compare object detectors.

mAP of various models performing Real-Time Object Detection on COCO (Adapted from paperswithcode).

It is important to keep in mind that even though there is a big speed difference in one-stage to two-stage some two-stage detectors can actually inference in real time like Mask RCNN, this is why we can see both type of detectors in this Figure. The current top-performing model is a version of YOLO, there is a series of versions of this algorithm which will be further detailed in the following blog-post.

A shift in the computer vision paradigm

The models presented in the past Figure present the real time detectors; as discussed before the inference speed of a detector on the model and architecture used. Many of the improvements of real-time object detectors are ported from non-real time detectors once new papers are published, so these can be seen as a “future” of where real-time object detection could be in terms of accuracy. This makes it important to analyse not only realtime detectors but also non-real time detectors. The backbone of traditional detectors are typically CNN or RCNN based; with the development of new technologies such as Transformer Neural Networks the future of object detection is taking a shift. In the 2022, there seems to be a shift in paradigm to the use of Swin Transformers as a Backbone for object detectors, reaching SOTA results like DINO which significantly reduces its model size and pre-training data size while achieving better results compared to the previous SOTA. However, at the time of writing this thesis, there is no current solution for the use of Swin-Transformers use in real-time applications in object detection, since its inference speed is still low (due to being a recent breakthrough). Mentioning this type of algorithms is important, since it sets our eyes on a possible better solution in the future; for example some sort of YOLO type detector with a transformer backbone. The future of computer vision seems to be transformer based. The next Figure shows the top performing detectors (regardless of their inference speed).

Figure 2.9: mAP of top performing models in COCO dataset (regardless of being capable of real-time inference ot not) (Adapted from papers with code).

YOLO Series:

The YOLO series current provide the SOTA of object detection in real-time. In this blog post these YOLO algorithms will be discussed briefly and in a following blogpost they will be discussed in greater-detail.

Authors of YOLO proposed a different approach to existing object detectors at the time. Although the models increased in accuracy each year, object detectors lacked speed that allowed them to perform in real time. YOLO presented a new approach, instead of re-purposing a classifier to perform object detection, YOLO instead framed object detection as a regression problem to spatially separated bounding boxes with associated class probabilities. By doing so the entire object detection pipeline was turned into a single network that could be optimized end-to-end directly on detection performance (instead of having to maximize a classifier then perform classification with it).

YOLO divides images into regular grids and performs detection and localization on those very same grids (Residual Blocks). These grids return three things; the bounding box coordinate with respect to their cell coordinates, the object label, and the probability that the object is in the cell grid.

YOLO divides the image into an S × S grid and for each grid cell predicts B bounding boxes.

This makes the algorithm fast and lowers computation cost since detection and recognition is made in a single shot by the cells. One drawback of this approach is that it creates duplicate predictions because each cell predicts a bounding box. So, many times the same object is being predicted with multiple different bounding boxes. All this “noise” is passed through a Non-Maximal suppression algorithm, suppressing the bounding boxes that have lower probability scores. In summary, it divides the image into grids of equal size, performs object detection and classification, and eliminates noise with Non-maximal suppression. That is, choosing the highest probability score and suppress all the bounding boxes having the largest IoU with the current high probability bounding box, then it repeats these stages in a loop until the final bounding boxes are obtained as seen on the next Figure.

The architecture consists of three key components: the head, neck, and backbone. The backbone is the part of the network made from convolutional layers to extract key features from an image. These first layers are trained on a large data-set with low resolution like ImageNet, then the neck uses those features with fully connected layers to make predictions on probabilities and bounding boxes. Finally, there is the head of the detector that can be interchanged with other layers with the same input shape for transfer learning.

YOLOv2 v3 v4 v5 , YOLOR and YOLOX Will be aproached in detail in a later blog post.

However here’s a figure that sums it up:

Continuation:

This article is part of the 3 Part articles I’m writting on Object Detection and YOLO series

Part 2: From YOLO to YOLOv4

https://medium.com/@pedroazevedo6/from-yolo-to-yolov4-3dcba691d96a

Part 3: What is the Best YOLO?

https://medium.com/@pedroazevedo6/what-is-the-best-yolo-8526b53414af

Acknowledgements

M. Research. Image recognition: Current challenges and emerging opportunities — microsoft research.

S. Chung, K. Shek, J. Butterfield, A. Murphy, J. Butterfield, and I. Spence. Current state of the art in object detection for autonomous systems, 2021.

A. Balasubramaniam and S. Pasricha. Object detection in autonomous vehicles: Status and open challenges. 2022.

S. Chung, K. Shek, J. Butterfield, A. Murphy, J. Butterfield, and I. Spence. Current state of the art in object detection for autonomous systems, 2021.

F. Nobis, M. Geisslinger, M. Weber, J. Betz, and M. Lienkamp. A deep learningbased radar and camera sensor fusion architecture for object detection. 2019 Symposium on Sensor Data Fusion: Trends, Solutions, Applications, SDF 2019, 5 2020

T. Wang, X. Zhu, J. Pang, and D. Lin. Fcos3d: Fully convolutional one-stage monocular 3d object detection. 2021

S. K. Pal, A. Pramanik, J. Maiti, and P. Mitra. Deep learning in multi-object detection and tracking: state of the art. Applied Intelligence, 51:6400–6429, 9 2021.

Y. Fang, S. Yang, S. Wang, Y. Ge, Y. Shan, and X. Wang. Unleashing vanilla vision transformer with masked image modeling for object detection. 2022.

Object Detection State of the Art 2022

Key Metrics:

IoU

However here’s a figure that sums it up:

Written by Pedro Azevedo