Fusing LIDAR and Camera data — a survey of Deep Learning approaches

Both LIDAR and camera outputs high volume data. Radar output mostly appears to be lower volume as they primarily output object list. However, with recent advances in imaging radars at 80 GHz, it conceivable that some of these will optionally output a point cloud type data.

A large body of work exists on image based object detection. Point cloud work is relatively of recent nature, hence not as mature. Also, due to sparsity, semi-regular, and 3D nature of point clouds, processing is more compute intensive and challenging. Hence fusion of point cloud and image currently is primarily dictated by how point cloud processing is done.

There appears to be 3 major approaches for processing point cloud data. These are broadly 3D voxel grid based, 2D Bird’s Eye View (BEV) based , and Adhoc.

In the 3D Voxel Grid category, there are VoxelNet[1], Li [2], Fast-and-Furious[5]

In the 2D BEV category there are MV3D [3], Wang et. al. [4], Fast-and-Furious[5], PIXOR [6], AVOD [7].

In the Adhoc category there are PointFusion [11], and Frustum Pointnet [12] .

For the record , the figures are copied from the the original papers along with their original captions.

VoxelNet [1]:

VoxelNet as published is not exactly a LIDAR and image fusion architecture. However it is a highly performant architecture for 3D bounding box detection , and potentially can be integrated/fused with 2D bounding box detection architectures for better accuracy.

VoxelNet divides the space around a LIDAR into voxels of fixed sizes for. The authors used 0.2m x 0.2m x 0.4m( HxWxD) size voxels for car and pedestrian detection.

The architecture is composed of 3 major parts — Feature Learning Network (FLN), Convolutonal Middle Layers (CML) and Region Proposal Network (RPN) as shown in Fig A.

Fig A: Voxelnet architecture.

The Feature Learning Network first does voxelization of space. Then points are bucketized into voxels. To keep buckets somewhat balanced in number of points, random sampling and point count limits are used. Each point in a voxel is represented as a 7 dimension element composed of 3 absolute co-ordinates, 3 relative co-ordinates, and reflectance. Each point is processed individually through Voxel Feature Encoding (VFE) layers for generating per-point features. The per-point features are also max-pooled to generate per-voxel feature, which is then concatenated with the per-point feature. Each VFE layer is a Dense/Fully-Connected layer with ReLU activation and Batch Normalization ( BN). As 90% voxels are empty, per-voxel encoding accommodates sparse tensor processing through processing only the non-empty voxels. The last VFE layer generates voxel-wise feature The empty voxel features are converted to zero for dense mode processing in the following stages.

Each Convolutional Middle Layer (CML) is composed of 3D convolution, Batch Normalization layer, and ReLU layer sequentially. Each 3D convolutional layers has 3D kernels, strides, and paddings (e.g. Conv3D(Cin, Cout, (kx,ky,kz), (sx,sy,sz), (px,py,pz) ). As for 2D pixels, convolutional layers aggregate voxel-wise features within a progressively expanding receptive field, adding more context to the shape description. The authors used 3 CMLs in their architecture. The CML’s 3D convolution also essentially collapses the Z dimension from 10 to 2 which eventually is merged with the channel dimension. So, in essence it becomes a view processing, although the RPN regresses in the depth dimension , and hence provides 3D bounding boxes.

RPN , shown in Fig B, is the last stage of processing and is composed of 3 sub-blocks of 2D convolutional layers followed by 3 deconv layers. The 2D convolutional sub-blocks are composed of down-sampling, squeezing, batch normalization and ReLU. The output from the deconv layers are concatenated. The concatenated feature map is 1/4 th the size of the original image and can be looked as simplified version of the image. Each point in the feature map is considered a anchor point. After concatenation one more 2D convolutional processing is applied to generate probability and bounding box regression. The class probability and bounding box regression is encoded in the channels. The authors only considered one size anchor boxes at a fixed Z location with 2 yaw poses.

Unlike in the Faster-RCNN approach, no explicit cropping is used. The loss function used essentially forces the VFE, CML, and RPN layers to focus on minimizing the difference between targets and anchor+corrections (score and regression map).

Fig B: VoxelNet RPN

Voxelnet was reported to have achieved 81.97% 65.46% 62.85% mAP in the car category of KITTI 3D detection data set.

PointNet and PointNet++:

One of the major advantages claimed of PointNet is that it is able to handle point clouds in any order (e.g. permutation independence on the order of points in the cloud).

In PointNet [9], the original point cloud is passed through a few 2D convolution layers, that outputs a vector of 1024 length for each point. A maxpool is taken over each point for each element of the 1024 vector. Hence the point cloud is transformed to a 1024 feature vector. Which is further processed with Dense/Fully-Connected layers to get the final classification. The authors in the PointNet paper extended the basic classification to segmentation and object detection also.

In PointNet++ [10], the authors extended the idea to address short comings of PointNet — PointNet is only able to handle large global features. The authors closed that gap by introducing a hierarchy of group of points. Then applying PointNet to the hierarchical groups. Eventually concatenating all the vectors from all layers are passed through fully connected layers to determine class. From one angle, PointNet++ looks like a “generalization” of voxel based processing.

Fig C: PointNet++ hierarchy.

PointFusion [11]:

PointFusion is a fusion only network that uses PointNet along with a 2D object detector. The overall network architecture is shown in Fig D. As can be seen in Fig D, there are 2 branches — one for point cloud and one for images

Fig D: PointFusion network architecture.

During training, 2D ground truth boxes ( the authors augmented by randomly sized and shifted these by 10% also) are used as the cropped image to the image branch.Which is passed through a image classifier such as ResNet for extracting features. For the point cloud branch, only the points which can be projected onto the cropped image are selected. Features are extracted in that branch using PointNet.

The image feature, the point-cloud global feature and the point cloud per-point feature are concatenated to produce a fused feature The fused feature is passed through a number of Fully Connected/Dense/Inner Product layers. The author’s tried 2 different types of outputs — one outputs the absolute value of 8 corners of the 3D bounding box; the other relative offset (and score) from each point cloud point.

Fustrum PointNet [12]:

The lead author of PointNet and PointNet++ extended his network into 3D object detection in this work [12] while interning at the autonomous vehicle company Nuro. The end to end architecture is composed of 3 networks — Fustrum Proposal, 3D Instance Segmentation, and Amodal (partially occluded ) 3D Box Estimation as shown in Fig E.

Fig E: Frustum PointNet architecture.

The Frustum Proposal network first detects object in 2D using a pre-trained Fast-RCNN and FPN . Then constructs a frustum (a frustum is the part of a solid between two parallel planes) from the 2D image to as far as point cloud goes. In this case it is a pyramidal frustum as bounding boxes are rectangular. The authors do a couple of transformation of the co-ordinate — projects the image to be perpendicular to the frustum and moves the origin to the centroid of the point cloud in the frustum.

A PointNet based network is used to do segmentation of LIDAR points in the frustum. The segmentation is for whether a point is part of the 3D object or not ( the assumption is that there is only one object in the frustum). The authors also use the available 2D semantic class information to guide the shape.

The authors use a light weight PointNet to further tune the bounding box, as the centroid of segmented point cloud can be far from the centroid of amodal box.

On the KITTI 3D object detection test data set, the authors were able to get 81.2/51.2/71.96 % mAP for easy car/pedestrian/cyclist.


The AVOD architecture uses the same type of feature extractor for both point cloud BEV and camera image. See Fig F, for the overall architecture . The feature extractor is composed of a encoder and a decoder, Fig G. The encoder is a modified VGG16 , and the decoder is composed of deconv or transposed-convolution concatenated and mixed with corresponding output from the encoder. The decoder step produces feature images of the same size as the original image and BEV .

Both features are taken through 1x1 squeeze layers to reduce number of channels, and then through crop+resize layers. The crop+resize layers gets the cropping boxes by projecting 3D anchor boxes from a anchor generation layer. The cropped boxes from both image and BEV are 2D and of same size, and hence can be fused by taking element wise mean. The dense layers following fusion, generates objectness and bounding box regression for 3D boxes as is done by CNN layers for 2D boxes in Faster-RCNN RPN. The NMS layer at the end of “RPN” block reduces the number of ROIs by order of magnitude.

The proposal ROIs from the “RPN” NMS are projected on to image and BEV feature space to generate respective versions. This is the start of the “RCNN” part. The projected crops are further used to crop+resize respective features. These are again fused through element-wise mean , then classified+ regressed and NMS’ed to generate the final classes and 3D bounding boxes.

Fig F: The AVOD architecture.
Fig G: AVOD Feature Extractor

On the KITTI 3D object detection data set they were able to get 81.94%/71.88%/66.38% mAP for easy/medium/hard objects (cars).

Fast and Furious :

The Fast and Furious paper out of Uber does not address fusion of multiple sensor modalities, however they do an interesting fusion of LIDAR data over temporal domain. In addition they also address 3D object detection, tracking and motion forecasting.

Their approach is a hybrid of voxelization and bird’s eye view. They first voxelize the space in 0.2m x 0.2m x 0.2m cubes. Then they treat the height dimension as channel, Fig H. Thus converting the 3D point cloud into 2D BEV frame. They treat time as the 3 rd dimension by stacking 2D frames for 5 consecutive time instances.

Fig H: Voxelization in Fast and Furious .

They studied 2 ways of handling the temporal information, which they call Early and Late Fusion, show in Fig I. In Early Fusion, temporal information is aggregated in the first layer with 1D convolution from n to 1. This amounts to producing a single point cloud from the n temporal frames, and lacks the ability to distinguish complex temporal features.

Fig I : Early and Late fusion of temporal point cloud.

In Late Fusion, the authors use 2 layers of 3D (e.g. kernel 3x3x3) convolution. No padding is used in the temporal dimension, hence after 2 steps collapses from n to 1. The following layers perform 2D convolution (e.g. kernel 3x3).

The authors then add convolution layers to output bounding box prediction for current and n-1 time instance in future, see Fig J. In addition also adds layers to output probability of vehicle for the current time instance bounding boxes.

Fig J: Bounding box

The authors use SSD as the detector mechanism. They used 6 predefined boxes ( 5 meters in the real world with aspect ratio of 1 : 1, 1 : 2, 2 : 1, 1 : 6, 6 :1 and 8 meters with aspect ratio of 1 : 1) over the points of feature map.

At each time step, the model produces aggregate bounding boxes for all the n time instances. Which leands the tracking capability of the author’s network. If there is significant overlap between a past predicted and a current bounding box, then they are considered it to be the same object, and the bounding box is averaged.

The authors used custom data sets for training and testing. They were able to get 30 ms processing at 83.10% mAP at IOU=0.7 on a 4 Titan XP GPU system.


  1. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection, https://arxiv.org/abs/1711.06396
  2. 3d fully convolutional network for vehicle detection in point cloud, https://arxiv.org/abs/1611.08069
  3. Multi-View 3D Object Detection Network for Autonomous Driving, https://arxiv.org/abs/1611.07759
  4. Fusing Bird View LIDAR Point Cloud and Front View Camera Image for Deep Object Detection, https://arxiv.org/abs/1711.06703
  5. Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net, http://openaccess.thecvf.com/content_cvpr_2018/papers/Luo_Fast_and_Furious_CVPR_2018_paper.pdf
  6. PIXOR: Real-time 3D Object Detection from Point Clouds, http://openaccess.thecvf.com/content_cvpr_2018/papers/Yang_PIXOR_Real-Time_3D_CVPR_2018_paper.pdf
  7. Joint 3D Proposal Generation and Object Detection from View Aggregation, https://arxiv.org/abs/1712.02294, Aggregate View Object Detection (AVOD) code : https://github.com/kujason/avod
  8. SPLATNet: Sparse Lattice Networks for Point Cloud Processing, http://vis-www.cs.umass.edu/splatnet/
  9. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, https://arxiv.org/abs/1612.00593,
  10. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, https://arxiv.org/abs/1706.02413 , https://github.com/charlesq34/pointnet2 .
  11. PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation , https://arxiv.org/abs/1711.10871
  12. Frustum PointNets for 3D Object Detection from RGB-D Data, https://arxiv.org/abs/1711.08488, https://github.com/charlesq34/frustum-pointnets
  13. http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d