Accelerated Lidar based 3D Object Detection on Texas Instruments’ TDA4 Processors

8 min readMar 7, 2022

Introduction

Object detection is one of the most important problems in the computer vision domain. It has become a must feature in many applications such as Advanced Driver Assistance Systems (ADAS) and autonomous driving (AD). 3D object detection provides additional information compared to 2D object detection such as the exact location in the 3D world, the height of the object, and the orientation of the object. The precise information in turn provides a better understanding of the scene around the ego vehicle in the driving scenario and increases overall safety. Accurate 3D object detection needs a point-cloud as input for processing, which can be generated through Lidar sensor mounted on the vehicle.

PointPillars [1] is one the most popular techniques based on Convolution Neural Network (CNN) for 3D object detection from the point-cloud. PointPillars provides good accuracy of 3D object detection with reasonable usage of computational resources. However, it poses a few challenges before it can be implemented efficiently on embedded platforms. For optimal execution on an embedded platform, it is important to represent the whole network as a single graph. However, out-of-the-box PointPillars network in a training framework (e.g. mmdetection3d [2]) when exported (using ONNX) will create multiple sub-graphs which are not efficient due to the requirement of data handling at the boundary of sub-graphs.

To tackle this problem, we provide an enhanced training framework based on mmdetection3d [2] which can be used to export a unified ONNX model and can be executed efficiently on TI Deep Learning Library (TIDL) [4] on TDA4x series of processors. Also, this framework can be used to train the PointPillars network on the custom dataset as well. Users can evaluate the compute performance at TI EdgeAI cloud [6] and test the algorithm accuracy on KITTI dataset using TI Edge-AI-Benchmark tool [7]. Efficient implementation of PointPillars CNN modal on TI TDA4 SoC (Figure 1) is discussed in the next section.

Implementation of PointPillars on TDA4

PointPillars network compute blocks are shown in Figure 2. Lidar sensor captures the points in 360-degree field of view, however, points in the front camera view are useful for detecting objects in field of view (fov). Hence the user may prune out the point cloud which is not in front of the driving view and also not in the interested range. Interested range is user-configurable parameter and can be altered at inference time. Thereafter point cloud is segregated into multiple vertical grids called voxels, as shown in Figure 3. Voxel size by default is 16 cm x 16 cm in bird’s eye view (BEV), and it is user-configurable and it should match at training and inference time. Voxels that do not have any 3D point are discarded from the computation. After voxelization data shape is D x N x P, where P is the total number of non-empty voxels and it will be always less than or equal to BEV resolution which is WxH, N is the maximum number of points allowed in voxel, and D is the lidar point information/features (typically 9 or 10). After PointNet operations convolution data volume changes from DxNxP to CxNxP, in our use case C is 64.

Figure 2: PointPillars [1] Building Blocks. Picture Source Credit to [1]

Figure 3: Voxelization, (Ref:https://openaccess.thecvf.com/content_cvpr_2018/papers/Luo_Fast_and_Furious_CVPR_2018_paper.pdf)

On the output of PointNet maximum operation on each feature is performed for all the points (across N dimension) in a given voxel, hence data size after the voxel encoder is Cx1xP, where C is a number of channels which is 64 in this case. It is referred to as a learned feature in Figure 2. Later the feature of size CxP is scattered over BEV the pseudo image at corresponding locations. One point in BEV pseudo image represents one voxel. After pseudo image creation, 3D space problem is converted into image-like representation with size C x W x H.

This pseudo image creation process as shown in Figure 4 is not an embedded friendly operation. In this process, as data movement is not dense but sparse, it eliminates the possibility of Single Instructions Multiple Data (SIMD) processing or any other faster data movement. The consecutive bytes in the source will not be contiguous in the destination memory. To address this, a specialized implementation of scatter operation is developed which is available as TIDL_ScatterElements layer in TI Deep Learning Library [4]. This accelerated layer helps to connect two disjoint convolution networks one before the pseudo image creation block and backbone network shown as two dashed blocks in Figure 2: PointPillars Building Blocks.

Typically output data is stored in external memory such as SDRAM/DDR at the end of one network execution. If a task is composed of multiple networks then at the end of each network execution, output data will be stored in DDR. This can be avoided if multiple networks are unified in single network as it is done in TI’s solution of point pillars network. End-to-end processing happening for point pillars network on the C7x/MMA processor helps in reducing the data transfer to and from DDR and hence achieves significant acceleration in 3D object detection on TDA4 based SoCs.

How to Use TI’s PointPillars?

TI provides cloud infrastructure where user can connect to TDA4 EVM remotely and test various model running on real hardware for multiple models from TI’s model zoo or any other similar models. This is named as TI Edge AI cloud and can be accessed at [6]. TI’s PointPillars model also can be tested over there. At the welcome page of TI Edge AI Cloud, ‘3D Object Detection’ with ONNX runtime environment should be selected as shown in Figure 5 TI Edge AI Cloud Selection for 3D OD. In this example preprocessing step, Stacked pillars and Pillar indices (as shown in Figure 2: PointPillars [1] Building Blocks) computation, is happening on host machine using python code, however reference C APIs for the same is made available as part of TIAD-ALG library [8] package. TIAD-ALG package is collection of APIs for non-deep learning modules. The name of the APIs are ‘tiadalg_voxelization_cn’ and ‘tiadalg_voxel_feature_compute_cn’ and are placed at < psdkra_rel/tiadalg/tiadalg_voxelization> folder in PSDK-RA (Processor SDK RTOS) release [12]. PSDK-RA can be used together with either Processor SDK Linux (PSDK Linux) or Processor SDK QNX (PSDK QNX) to form a multi-processor software development platform for TDA4VM and DRA829 SoCs within TI’s Jacinto™ platform. These APIs interface information is available at [8]. Output from this API acts as input to the TIDL executing on C7x/MMA deep learning core.

Figure 5 TI Edge AI Cloud Selection for 3D OD

After launch and EVM reserving, user can execute the cells in Jupyter notebook for step by step processing. Core processing cell is shown and briefly described in Figure 6: 3D OD Core Processing in Jupyter Notebook.

Figure 6: 3D OD Core Processing in Jupyter Notebook

Algorithm Accuracy:

Deep Learning accelerator on TDA4 SoC supports multiple fixed-point precision of 8b/16b/32b. 8b quantization of features and weights are desired for best compute efficiency, however, it has been observed that 8b quantization is not sufficient for getting good accuracy for the PointPillars network. Hence for a better tradeoff between accuracy and speed, it is mandatory to have Lidar input data in 16b, and the last convolution layers in 16b. TIDL supports layer wise bit precision selection (8b or 16b) for activation and weights. This advanced feature helps to reach 16b accuracy by just selecting start and last layer in 16b precision and keeping remaining layers in 8b. Below Table 1: Algorithm Accuracy depicts the accuracy of different configurations for the KITTI 3D OD dataset [3].

It can be observed that accuracy at 50% of IOU doesn’t degrade even for 8b inference, which might be sufficient for many use cases. Sample output is shown in Figure 5: Sample 3D output.

Compute Performance:

We present compute performance of PointPillars based on 3D object detection for 8-bit precision and mixed precision in Table 2: PointPillars compute performance. Maximum number of non-empty voxels is configured as 10000, and the BEV pseudo image resolution is 496 x 432. Increasing the maximum number of non-empty voxels doesn’t increase compute time in same proportion, as it affects only the initial few layers of the network. After BEV scatter layer, the complexity (also latency) of each convolution layer is proportional to BEV pseudo image resolution.

PointPillars Training for the Custom Dataset:

PyTorch-based training framework forked from https://github.com/open-mmlab/mmdetection3d will be made available and its formation will be provided at [9]. One can use this framework to train the PointPillars network for own dataset as well and should be able to export the ONNX model which can work directly with TIDL. The already pre-trained model (with KITTI data set [3]) and meta architecture related information can be downloaded from [9].

For TDA4 product and overall software information you may refer to [10], [11], [12], [13]. For any technical help on this topic please post your query at [14].