[Paper Summary] Comprehensive Survey on Camera-based 3D Object Detection for Autonomous Driving

Published in

jun-devpBlog

11 min readApr 28, 2024

This article is to organize and rephrase what I studied from the paper, 3D Object Detection for Autonomous Driving: A Comprehensive Survey.
Thus, the most of contents are from the authors of original survey paper.

1. Problem Definition

1.1 What is 3D Object Detection

Illustration of 3D object detection, image from [1]

3D object detection aims to predict the locations, sizes, and classes of critical objects, e.g., cars, pedestrians, cyclists, in the 3D space.

3D bounding box B which is represented by center point(x, y, z), size(l, w, h) and orientation(theta)

In other words, 3D object detection performs localization and classification simultaneously for all objects observed by the given modality(s), e.g., camera.

Types of 3D object detection vary by which type(s) of modality is used. In this article, for the sake of simplicity and consistency, only camera-based approaches and concepts are addressed.

1.2 Sensory Input: Camera

Cameras have advantages in affordability compared to other modalities like LiDAR (Light Detection And Ranging)

They can play crucial roles in understanding semantics in driving scenes, e.g., traffic signs and lights.
Typically cameras produce dense representation, i.e., RGB images where W, H denote width and height of the image, respectively.

Despite its usefulness, cameras suffer from inherent limitations that make them challenging but need to overcome for robust autonomous driving.

Only capture appearance information, meaning cameras are not capable of obtaining 3D structural information of a scene.
Camera images are, in general, vulnerable to extreme weather and time conditions due to particles of rain/snow on the camera lens or lack of illumination.

1.3 Comparisons with 2D Object Detection

In contrast to widely-applied 2D object detection where the goal is to find 2D bounding boxes to separate the object boundaries in 2D IMAGE space, 3D object detection does the same job with 3D bounding boxes in real-world 3D coordinate systems, e.g., camera, ego coordinate system.

Left: 2D bounding box, Right: 3D bounding box. Image is from here.

3D object detection methods have adapted many design paradigms from the 2D object detection. For example, anchor-based proposal generation, refinement, and non-maximum suppression(NMS). However, from many aspects, 3D object detection is not a naive adaptation of 2D object detection methods to the 3D space and requires accurate 3D geometric information due to the following reasons.

3D object detection methods normally leverage distinct projected views to generate object predictions: In contrast to 2D object detection where predictions are made only at the perspective view, 3D approaches have to consider different views to detect 3D objects, e.g., BEV, point view, and cylindrical view.
3D object detection has a high demand for accurate localization of objects in the 3D space. Even a decimeter-level localization error could lead to a detection failure of small objects like pedestrians and cyclists, while in 2D object detection, the same degree of localization error might not largely harm the detection score as Intersection over Union (IoU) does not get heavily deteriorated by such error in image space.

2. Camera-based 3D Object Detection

The rest of this article addresses camera-based 3D object detection methods.

Two tasks: Monocular 3D object detection and multi-view 3D object detection.
Three types of methods: image-only, depth-assisted, and prior-guided approaches.

3. Monocular 3D Object Detection

Detecting objects in the 3D space from monocular images is an ill-posed problem since a single image cannot provide sufficient depth information. Many endeavors have been made to tackle the object localization problem, e.g., inferring depth from images, and leveraging geometric constraints and shape priors. However, the problem is far from being solved. Monocular 3D detection methods still perform much worse than the LiDAR-based methods due to their poor 3D localization ability.

3.1 Image-only Monocular 3D Object Detection

Inspired by the 2D detection approaches, a straightforward solution is to directly regress the 3D box parameters from images via a convolutional neural network. The direct-regression methods naturally borrow designs from the 2D detection network architectures and can be trained in an end-to-end manner. These methods can be categorized into the single- and two-stage, or anchor-based/anchor-free methods.

3.1.1 Single-stage anchor-based methods

Anchor-based monocular detection methods resort to a set of 2D-3D anchor boxes placed at each image pixel and use a 2D convolutional neural network to regress object parameters from the anchor(parameters relative to the anchor).

Step 1. For each pixel [u, v] on the image plane, a set of 3D anchors, 2D anchors, and depth anchors are pre-defined.

Step 2. An image is passed through a convolutional network to predict the 2D and 3D box offsets below.

Box offsets to regress.

Step 3. The 2D and 3D bounding boxes, b_{2D} and b_{3D}, can be decoded as below, where [u^{c}, v^{c}] is the projected object center on the image plane.

Step 4. Lastly, the projected center and its depth d^{c} are transformed into the 3D object center [x, y, z]_{3D}.

K and T are the camera intrinsics and extrinsics, respectively.

3.1.2 Single-stage anchor-free methods

Anchor-free monocular detection approaches predict the attributes of 3D objects from images without the aid of anchors. In other words, an image is passed through a 2D convolutional neural network and then multiple heads are applied to predict the object attributes separately. The prediction heads generally include components explained below.

a category head to predict the object’s category
a keypoint head to predict the coarse object center [u, v]
an offset head to predict the center offset[\delta_{x}, \delta_{y}] based on [u, v]
a depth head to predict the depth offset \delta_{d}
a size head to predict the object size [w, h, l]
an orientation head to predict the observation angle a.

The 3D object center [x, y, z] can be converted from the projected center [u^{c}, v^{c}] and depth d^{c} as follows.

where \sigma and \alpha denote the sigmoid function and observation angle, respectively.

3.1.3 Two-stage Methods

Two-stage monocular detection approaches generally extend the convolutional two-stage 2D detection architectures to 3D object detection. First stage: they utilize a 2D detector in the first stage to generate 2D bounding boxes from an input image.

Second stage: the 2D boxes are lifted up to the 3D space by predicting the 3D object parameters from the 2D RoIs.

3.1.4 Analysis: Potentials and Challenges of the image-only methods

The image-only methods aim to directly regress the 3D box parameters from images via a modified 2D object detection framework.

A critical challenge of the image-only methods is to accurately predict depth d^{c} for each 3D object. Note that simply replacing the predicted depth with ground-truth yields more than 20% car AP gain on the KITTI dataset, while replacing other parameters only results in an incremental gain. This observation indicates that the depth error dominates the total errors and becomes the most critical factor hampering accurate monocular detection.

3.2 Depth-assisted Monocular 3D Object Detection

To achieve more accurate monocular detection results, many papers resort to pre-training an auxiliary depth estimation network. Specifically, a monocular image is first passed through a pre-trained depth estimator, e.g., MonoDepth or DORN, to generate a depth image.

Then, there are mainly two categories of methods to deal with depth images and monocular images.

Illustration of depth-assisted monocular 3D object detection methods.

The depth-image based methods fuse images and depth maps with a specialized neural network to generate depth-aware features that could enhance the detection performance.
The pseudo-LiDAR based methods convert a depth image into a pseudo-LiDAR point cloud, and LiDAR-based detectors can then be applied to the point cloud to predict 3D objects.

3.2.1 Depth-image-based methods

Most depth-image based methods leverage two backbone networks for RGB and depth images, respectively. They obtain depth-aware image features by fusing the information from the two backbones with specialized operators. More accurate 3D bounding boxes can be learned from the depth-aware features and can be further refined with depth images.

MultiFusion introduces the depth-image based detection framework.
The following papers adopt similar design paradigms with improvements in network architectures, operators, and training strategies, e.g., point-based attention, depth-guided convolutions, disentangling appearance and localization features, and a novel depth pre-training framework.

3.2.2 Pseudo-LiDAR-based methods

Pseudo-LiDAR based methods transform a depth image into a pseudo-LiDAR point cloud, and LiDAR-based detectors can then be employed to detect 3D objects from the point cloud.

Pseudo-LiDAR point cloud is a data representation, where depth map D is converted into a pseudo point cloud P.

Specifically, for each pixel [u, v] and its depth value d in a depth image, the corresponding 3D point coordinate [x, y, z] in the camera coordinate system is computed as below:

where [c_{u}, c_{v}] is the camera principal point, and f_{u} and f_{v} are the focal lengths along the horizontal and vertical axis respectively. Thus P can be obtained by back-projecting each pixel in D into 3D space. P is referred as the pseudo -LiDAR representation: it is essentially a 3D point cloud but is extracted from a depth image instead of a real LiDAR sensor.

Finally, LiDAR-based 3D object detectors can be directly applied on the pseudo-LiDAR point cloud P to predict 3D objects.

PatchNet challenges the conventional idea of leveraging the pseudo-LiDAR representation P for monocular 3D object detection.

3.3 Prior-guided Monocular 3D Object Detection

Numerous approaches try to tackle the ill-posed monocular 3D object detection problem by leveraging the hidden prior knowledge of object shapes and scene geometry from images.

The broadly adopted prior knowledge includes object shapes, geometry consistency, temporal constraints, and segmentation information.

Illustration of the prior types in monocular 3D object detection methods.

Object shapes: Many methods resort to shape reconstruction of 3D objects directly from images. The reconstructed shapes can be further leveraged to determine the locations and poses of the 3D objects. There are 5 types of reconstructed representations: computer-aided design(CAD) models, wireframe models, signed distance function (SDF), points, and voxels.
Geometry consistency: Given the extrinsics matrix T that transforms a 3D coordinate in the object frame to the camera frame, and the camera intrinsic matrix K that projects the 3D coordinate onto the image plane, the projection of a 3D point [x, y, z] in the object frame into the image pixel coordinate [u, v] can be represented as below where d is the depth of the transformed 3D coordinate in the camera frame.

The above equation provides a geometric relationship between 3D point and 2D image pixel coordinates, which can be leveraged in various ways to encourage consistency between the predicted 3D objects and the 2D objects on images. There mainly are 5 types of geometric constraints in monocular detection: 2D-3D boxes consistency, keypoints, object’s height-depth relationship, inter-objects relationship, and ground plane constraints.
Temporal constraints: Temporal association of 3D objects can be leveraged as strong prior knowledge. The temporal object relationships have been exploited as depth-ordering and multi-frame object fusion with a 3D Kalman filter.
Segmentation Information: Object segmentation maks are crucial for instance shape reconstruction. Further, segmentation indicates whether an image pixel is inside a 3D object from the perspective view, and this information has been utilized to help localize 3D objects.

4. Multi-view 3D Object Detection

Autonomous vehicles are generally equipped with multiple cameras to obtain complete environmental information from multiple viewpoints. Some multi-view 3D object detection approaches try to construct a unified BEV space by projecting multi-view images into bird’s-eye view, and then employ a BEV-based detector on top of the unified BEV feature map to detect 3D objects.

The transformation from camera views to the bird’s-eye view is ambiguous without accurate depth information, so image pixels and their BEV locations are not perfectly aligned. How to build reliable transformations from camera views to the bird’s-eye view is a major challenge in these methods. Other methods resort to 3D object queries that are generated from the bird’s-eye view and Transformers where cross-view attention is applied to object queries and multi-view image features. The major challenge is how to properly generate 3D object queries and design more effective attention mechanisms in Transformers.

4.1 BEV-based Multi-view 3D Object Detection

LSS is a pioneering work that proposes a lift-splat-shoot paradigm to resolve the problem of BEV perception from multi-view cameras. There are three steps in LSS.

Lift: bin-based depth prediction is conducted on image pixels and multi-view image features are lifted to 3D frustums with depth bins.
Splat: 3D frustums are splatted into a unified bird’s-eye view plane and image features are transformed into BEV features in an end-to-end manner.
Shoot: downstream perception tasks are performed on top of the BEV feature map.

This paradigm has been successfully adopted by many following works. BEVDet improves LSS with a four-step multi-view detection pipeline, where the image view encoder encodes features from multi-view images, the view transformer transforms image features from camera views to the bird’s-eye view, the BEV encoder further encodes the BEV features, and the detection head is employed on top of the BEV features for 3D detection.

Note that the major bottleneck in BEVDet is depth prediction, as it is normally inaccurate and will result in inaccurate feature transforms from camera views to the bird’s-eye view. To obtain more accurate depth information, many papers resort to mining additional information from multi-view images and past frames.

e.g., depth supervision, surround-view temporal stereo, dynamic temporal stereo.

Note that some papers completely abandon the design of depth bins and categorical depth prediction. They simply assume that the depth distribution along the ray is uniform, so the camera-to-BEV transformation can be conducted with higher efficiency.

4.2 Query-based Multi-view 3D Object Detection

In addition to the BEV-based approaches, there is also a category of methods where object queries are generated from the bird’s-eye view and interact with camera view features.

DETER3D introduces a sparse set of 3D object queries, and each query corresponds to a 3D reference point. The 3D reference points can collect image features by projecting their 3D locations onto the multi-view image planes and then object queries interact with image features through Transformer layers. Finally, each object query will decode a 3D bounding box.

Many following papers try to improve this design paradigm, such as introducing spatially-aware cross-view attention and adding 3D positional embeddings on top of image features. BEV Former introduces dense grid-based BEV queries and each query corresponds to a pillar that contains a set of 3D reference points. Spatial cross-attention is applied to object queries and sparse image features to obtain spatial information and temporal self-attention is applied to object queries and past BEV queries to fuse temporal information.

Reference

[1] 3D Object Detection for Autonomous Driving: A Comprehensive Survey.

github page