part.1: Didi Udacity Challenge 2017 — Car and pedestrian Detection using Lidar and RGB

Mar 18, 2017 · 3 min read

The code for part.1 is at:

I would like to keep a diary of the development process for my entry to the Didi Udacity Challenge 2017 at

As an engineer, I always abide by the rule: “Love the problem and not the technology”. After understanding the problem and issues, the first step is to setup some baseline system in the shortest possible time. I choose to implement the latest CVPR 2017 paper[1]. Others relevant papers that I have found are[2],[3],[4].

According to [1], the basic network structure is as follows:

I begin with the 3d proposal net, which is essentially a modified faster-rcnn net.

[A. Pre-processing]
During development, visualization is very important. It helps to ensure that the code implementation and mathematical formulation are correct. I first covert a rectangular region of lidar 3d point cloud into a multi-channel top view image. I use the kitti dataset for my initial development (2011_09_26_drive_0005_sync).

Paper[1] has a rather good processing method for this conversion. The left below shows the top view image of the lidar point cloud (and the target ground truth box in yellow). The right shows the pre-processed image after conversion. Note that the object “stands out” from the background well.

Here is the same pre-processed image broken into its 8 channels.

[B. Training Net]
To make sure that my code is bug-free, I try to overfit a single training image. The loss should decrease to almost zero in this case, giving rise to “perfect” results.

In this work, I learn to use tf.py_func() to create customized tf layer. For example, in the generation of +ve (red) and -ve (gray) anchor boxes training samples for the 3d proposal net:

The “perfect results” on my single training image. The yellow is the detected anchor boxes containing the target object after classification and the red is the predicted object size after 2d-box regression.

The score map of the classification for each of the 9 anchor box types:

[C. Conclusion]
Just from these 2 simple steps, I observed the following possible issues:

  1. Small object detection. This is a well-known weakness in the original plain faster rcnn net.
  2. Creation of 2d top view image could be slow. There are quite a number of 3d points needs to be processed

Now that I am sure that the implementation is correct, the next step will be to start training with the actual dataset, which contains many images.


[1]”Multi-View 3D Object Detection Network for Autonomous Driving” — Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, Tian Xia, CVPR 2017

[2]”Vote3Deep: Fast Object Detection in 3D Point Clouds Using Efficient Convolutional Neural Networks” — Martin Engelcke, Dushyant Rao, Arxiv Mar-2017

[3]”Vehicle Detection from 3D Lidar Using Fully Convolutional Network” — Bo Li, Tianlei Zhang and Tian Xia, Arxiv 2016

[4]”3D Fully Convolutional Network for Vehicle Detection in Point Cloud” — Bo Li, Arxiv 2016