Member-only story
Lifting 2D object detection to 3D in autonomous driving
Monocular 3D object detection predicts 3D bounding boxes with a single monocular, typically RGB image. This task is fundamentally ill-posed as the critical depth information is lacking in the RGB image. Luckily in autonomous driving, cars are rigid bodies with (largely) known shape and size. Then a critical question is how to effectively leverage the strong priors of cars to infer the 3D bounding box on top of conventional 2D object detection.
In contrast to conventional 2D object detection which yields 4 degrees of freedom (DoF) axis-aligned bounding boxes with center (x, y) and 2D size (w, h), the 3D bounding boxes in autonomous driving context generally have 7 DoF: 3D physical size (w, h, l), 3D center location (x, y, z) and yaw. Note that roll and pitch are normally assumed to be zero. Now the question is, how do we recover a 7-DoF object from a 4-DoF one?
One popular way, proposed by the pioneering work of Deep3DBox (3D Bounding Box Estimation Using Deep Learning and Geometry, CVPR 2017) is to regress the observation angle (or local yaw, or allocentric yaw, as explained in my previous post) and 3D object size (w, h, l) from the image patch enclosed by the 2D bounding box. Both the local yaw and the 3D object size (which usually assumes a unimodal distribution with small variance around subtype mean) are strongly tied to…