An introduction to 3D Object Detection in Autonomous Navigation.

12 min readFeb 8, 2022

Object detection in point clouds is rather complex in comparison to its 2D little brother. Here, you will find a brief introduction about the most important parts to consider in your journey of implementing an object detector for 3D data.

Structure

(1) Why 3D data
(2) Sensors
(3) Datasets
(4) Neural Networks

(1) Why 3D data

3D data is the key to understanding the world on another level, but unfortunately not easy to obtain. This is also the reason why 3D networks not have taken over yet. The third dimension enables understanding the world in a more human way, or even better. Just image we are driving a car towards the following gate-like structure:

With 2D data only we could impossibly know that the gate can be passed by driving right through it (and then avoiding the obstacle behind it). Another real-world example would be passing a tunnel.

3D data

3D data is not described by its RGB values, but instead by its position in space (x,y,z coordinates). The origin point (0,0,0) is set to the object which captures the data (in this case a drone). Accordingly, everything else, like the person is described in relation to the origin point. For example, the person could be at point (x,y,z) = (5,1,0). Other permutations of the (x,y,z) values are possible depending on the dataset you are working on. For example, the KITTI data is represented like this:

(2) Sensors

LiDAR

If we talk about point clouds, LiDAR is of cause has to be mentioned. The most well-known sensor is the Velodyne HDL-64E, a 12,7 kg beast, which was launched in 2007 and has dominated the high-end LiDAR. In 2021, Velodyne announced to discontinue the sensor, however, replacing it with lighter ones like the PUCK (below 1kg). The functioning of those sensors is simple in that they send out multiple vertically aligned light beams to measure distance. After being reflected from surfaces the light beams are subsequently captured by the sensor and hence, the distance can be calculated:

, where cₐᵢᵣ is the speed of light. You might wonder, what do we do with a single distance measurement? Nothing! Instead, we need thousands of them to gain a meaningful representation of our surroundings. Therefore, LiDAR sensors rotate, and while doing so continuously make their measurements to get a full 360° view. The final result is a point cloud:

A single point cloud consists of a list of (x,y,z) point coordinates, where the center point (0,0,0) is located at the sensor's location. You probably noticed the ring-like distribution of points around the center point (location of the car), which is caused by the rotation of the LiDAR sensor.

There are also other manufacturers for LiDAR sensors than Velodyne, like Ouster, InnovizOne, Aeva, and Luminar.

LiDAR pro: high range / accuracy
LiDAR con: high weight / cost

Solid-State LiDAR (SSD)

The next sensor type we want to have a look at are Solid-State LiDARs. Even though mechanical LiDARs came a long way, the lightest solution (Puck LITE) is still quite heavy at 0.5 kg. Concerning small robots like a drone, this is significant. Additionally, moving parts of mechanical LiDARs require much energy and maintenance due to vibration, mechanical shocks, temperature, and humidity changes. Therefore a new technique was developed called SSD.

0ptical emitters send out bursts of photons in specific patterns and phases to create directional emission.

The two main approaches for SSD are OPA (optical phased arrays) and MEMS (microelectromechanical systems). An OPA sends out bursts of photons in specific patterns and phases to create directional emissions. On the other hand, MEMS-based scanners use micro-mirrors to control the direction of emission and focus. In other words, OPA and MEMS differentiate in the number of lasers used: Whereas OPA uses many, MEMS uses only one, targeted by micro-mirrors. These techniques are also used in the iPhone 12 Pro and 2020 iPad Pro.

Since SSD is a relatively new technique not many sensors have been released yet. The first successes available to the market have been made by Intel and Microsoft as they both launched their solid-state solutions in 2019 and 2020. The Microsoft Azure Kinect and the Intel L515 both have a low price at around 400$, however, the Kinect comes at five times the weight of the L515 (440g vs 100g). The Azure, on the other hand, seems to have a slightly better range and accuracy as shown in test footage. Both cameras are not made for autonomous driving, as the limited range of a maximum of 9 m indicates. Also, Intel clearly states that the L515 is not suitable for outdoor usage since the sunlight interferes with its laser, operating at 860 nm. Also, have a look at this article where the L515 is tested in different light conditions. Microsoft Azure Kinect has the same issue. For the future, other manufacturers like Velodyne and Ouster already announced long-range outdoor solid-state sensors coming in 2022. These new sensors might finally make LiDAR technology affordable for the mass market by reducing the cost and making it resilient to rough terrains like roadways, farmlands, construction sites, and mines.

SSD pro: low weight / medium accuracy
SSD con: low range / only indoor (will both change soon)

Stereo-Vision

Stereo-Vision is a non-LiDAR approach to reduce cost, weight, and mechanical complexity, by using regular RGB or infrared cameras for depth information. For the sake of completeness, it should be mentioned, that non-LiDAR approaches can either be monocular (single camera) or multi-view (multiple cameras). Here, we will concentrate on multi-view approaches, since they are more accurate without significant drawbacks. Since in Stereo-Vision the distance is not directly measured (like in LiDAR) camera-based methods use computer vision techniques to find depth cues in the image. For a better understanding of the core problem, see the following image, where a 3D scene is mapped to the 2d camera plane (labeled with “near clip plane”).

This mapping is also called 2D projection and describes the normal
function of a camera to produce a 2D representation of the real 3D world. Since the depth axis is irreversible deleted during the 2D projection, reading depth information from the image is impossible.

Addition: When using a single camera, algorithms could be used to gain depth information with the same techniques also found in human brains (for example, by searching for so-called depth cues like size, texture, perspective, and motion).

In case multiple cameras are used (stereo-vision), a powerful technique called stereopsis [49] can be applied to gain depth information. Stereopsis describes the possibility of calculating distances with the help of two camera images taken of the same object from different perspectives by triangulation. An intuitive example to understand stereopsis is shown in the next figure, where closer objects (the index finger) are displaced more than objects far away (the house) when viewed from two different angles. The displacement between those objects is called retina disparity and enables us to exactly calculate the distance with the equation z = (f ∗ b) /d, where z is the distance we are searching for, f is the focal length of the camera, b is the distance between the two cameras, and d equals x1 − x2.

Calculate distance with retina disparity. from here

Even though the triangulation equation enables us to precisely calculate the distance to any object, the true difficulty lies in a previous step — in determining the displacement of an object. This problem can efficiently be solved by using epipolar geometry (I will write a separate article). But still, the distance calculation is prone to inaccuracies in the following situations:

Pattern textured region: A pattern has multiple equal-looking areas.
Textureless region: Many pixels have the same pixel intensity.
Reflective surface: Mirror-like objects show textures of other objects.
Occlusion: Object is occluded in one view but not the other.
Violating the Lambertian property: When the brightness of an object does (not) change with the perspective.

Manufacturers

The market for stereo-vision approaches is dominated by Intel and Stereolabs. Intel’s stereo-vision sensors, the d415, d435i, and D455. The lightest model, the d435i, has a weight of 72g and is, therefore, the second lightest among all manufacturers. Stereolabs offers the ZED 2 and the ZED Mini. The main difference is that Intel offers onboard computation and an infrared emitter. On-board computation allows to directly output a depth map, by using the
Intel RealSense Vision Processor D4. In comparison, the ZED 2 and ZED mini both rely on an external GPU (like Jetson Nano) for computation, which means additional power consumption (5–10 watts for the Jetson Nano alone). Another advantage of the d435i is that it can operate in complete darkness, due to its infrared cameras (Stereolabs uses RGB).

Depth images created by the Intel Realsense d435i. from here

In my experiments, I have seen that the d435i produces useful results up to a distance of max. 7–10 meters with an optimal operation range much
lower (at 1 to 3 meters), whereas the ZED series can be used up to 20 m as shown by this YouTuber.

(3) Datasets

There are three interesting categories for 3D object detection that I spotted:

a) Autonomous Driving

KITTI, nuScenes, Waymo, Lyft, apolloscape, Ford

b) 3D Object Detection

NYU Depth, SUN RGB-D, ScanNet, Stanford 3D

c) Human Pose Detection

COCO, MPII, CrowdPose, PoseTrack, CMU Panoptic

a) The most obvious datasets are of course the one for autonomous driving, where we find annotations for 3d cuboids, lidar segmentation labels, 2d boxes, instance masks, and 2d segmentation masks. Here, KITTI is the most well-known since it depicts one of the earliest 3D datasets for autonomous driving. nuScenes is more versatile since it also includes recordings at night and in different weather conditions, which is important to train a robust network. Just remember, if something is not included in the training data, the network is likely to make errors later in real life. So choose wisely when selecting the right data. The sensor used for most of these datasets is the Velodyne HDL-64E or other high costs / high accuracy options.

b) The second category of datasets is made for the general task of 3D object detection outside autonomous driving. Common objects here are mostly found in households like chairs and cups. Annotations are also 2D / 3D objects and segmentations. Indifference to category a), most datasets are created with RGB-D sensors like the Kinect or artificially with 3D software. It must be said that RGB-D sensors are way less accurate than LiDARs and hence the noise in the point clouds (distance error) is higher. For example, in this image, you see the output of the super lightweight Intel Realsense d435i in distances of one and three meters:

Here, you see a person from the side, sitting on a chair. At three meters distance already point cloud becomes so disturbed that the person is hardly recognizable. However, the datasets listed were not recorded with the d435i, but instead with other sensors that have a slightly higher accuracy, leading to better data. Keep in mind though, depending on which sensor you are using, you might not find training data that works for you. The d435i for example has no public datasets available.

c) The last category of datasets does not directly fall into 3D detection, since most of them only contain 2D data and annotations. Here, certain key points like the joints of a person are of interest. The CMU Panoptic however was recorded with RGB-D sensors and therefore could also be used for networks using point clouds.

(4) Neural Networks

Most publications can be assigned to one of these categories:

The three approaches for 3D object detection.

A: Point-wise Feature Extraction

The first approach for 3D object detection is the point-wise extraction of features. This can be done by either using linear layers or 2D/3D convolutional layers. No matter which layer type you use, the network will be very slow due to the massive amount of points found in every cloud (10k — 200k points). An example is Pointnet (2017).

B: Aggregate View Object Detection (AVOD)

The second approach which currently has the most attention in research is the aggregation of points into subsets, where each subset could be seen as a representation of a certain area in 3D space. All points which fall into that area are now associated with it and aggregated by convolutions. Finally, you can treat these aggregated areas the same way as points in section A (Point-wise Feature Extraction), by forwarding them through a bunch of layers. These approaches can be very sophisticated in that the aggregation process considers various correlations between points. The most common types of areas for aggregations are grids, sets, and graphs, where the easiest solution is the grid. For the grid approach, the whole 3-Dimensional area is split into subsections (equally sliced or diced), and the points in each subsection are consequently aggregated into a single feature list. This process is also called voxelization. Famous networks which use voxelization are VoxelNet (2017), SECOND (2018), Pointpillars (2019), and SE-SSD (2021).

C: Frustum / Region of Interest

The third type of network tries to use the high accuracy found in 2D detectors by (1): extracting regions of interest (ROI) from an RGB image, which was captured at the same time the point cloud was, and (2): taking only the ROIs into consideration when detection 3D boxes in the point cloud. For example, if the 2D detector found a person in the RGB image, we can map, by using perspective transformations, the point cloud onto the 2D plane of the image, so that image pixels and points (from the cloud) can be perfectly associated with each other (they lie now in the same plane). Now, we can easily select only points from the cloud, where the 2D detector thought a person is, and hence, forward these points into another network for 3D processing. This approach works well since a 2D detector makes a preselection for the 3D detector, thus reducing the error. However, the approach has disadvantages:

Processing two instead of one data stream needs more processing power
The point cloud sensor and RGB camera need to be perfectly aligned to calculate the perspective transformations
Multiple sensors mean multiple sources of failure

One network for frustum-based extractions is Frustum PointNets.

Speed

Speed and accuracy must be balanced obviously. The more delicate a network is, the more time it takes to compute. It can be seen that Pointpillars from 2019 is still on top with a speed at nearly 50 fps, followed by SE-SSD with 30 fps, which also has the best accuracy at the time the chart was created (Arpil 2021).

Summary:

You got a brief summary of which topics matter in 3D object detection. For the future, there is a lot to expect regarding network speed and accuracy but also sensors. New navigation opportunities will arise due to a more reliable perception and therefore more confident navigation, especially with the rise of more sophisticated solid-state LiDAR solutions. Just imagine the possibilities of perfect perception. For example, a drone could fly at high speed without accidents, like a BOT in a computer game. Perfect perception is an important step towards higher intelligence algorithms.

Future topics:

Future articles will consider:

Augmentations in point clouds
Creating point clouds in stereo-vision and epipolar geometry
Perspective projections from 2d to 3d
Tensorflow Implementation of Pointpillars network

More articles about this and other ML topics will come soon. So get subscribed if you like.

Medium: Matthes Krull— Medium

About me: I’m a passionate deep learning engineer based in Berlin. You can contact me at krull.matthes@gmail.com.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

become a writer