PointPillars: Fast Encoders for Object Detection from Point Clouds — Brief

Mike Chan
3 min readJun 27, 2019

There are two main detection directions for object detection in Lidar information.
1. Project a 3D point cloud to a 2D image
2. Feature extraction directly using 3D point clouds

The first method has the characteristics of fast speed, high reference resources but low precision.The second method has a relatively high accuracy, but the Inferentce has a low speed.

At present, several common point cloud object detection neural networks are as follows:

  1. MV3D (as M)
  2. AVOD (as A)
  3. CONTFUSE (as C)
  4. VOXELNET (as V)
  5. FRUSTUM POINTNET (as F)
  6. SECOND (as S)
  7. PIXOR++ (as P+)

When the above network is verified using the KITTI database, the inference speed is usually less than 30hz, and most (except PIXOR) are even less than 20hz.

Accuracy is also very different, sorted as CONTFUSE > PIXOR > AVOD = FRUSTUM POINTNET > VOXELNET = SECOND > MV3D (by category: vehicle-based)

The PointPillars method (abbreviated as PP) introduced in this paper is quite excellent. The accuracy of the classification of the vehicle category is roughly equivalent to CONTFUSE, but the inference speed can reach 60hz far higher than other networks. Near-real-time object detection of point cloud data is almost possible, which greatly advances the possibility of increasing the speed of unmanned vehicles.

The principle is as follows:

It can be seen that PP is mainly composed of three major parts:
1. Pillar Feature Net -> 2. Backbone (2D CNN) -> 3. Detection Head (SSD)

1. Pillar Feature Net
Pillar Feature Net will first scan all the point clouds with the overhead view, and build the pillars per unit of xy grid. It is also the basic unit of the feature vector of PP. By calculating the point cloud in each Pillar, you can get D = for each point cloud. [x, y, z, r, xc, yc, zc, xp, yp]
among them:
x,y,z,r is a single cloud x, y, z, reflection
Xc, yc, zc is the point cloud point from the geometric center point of the pillar
Xp, yp is the distance from the center of pillar x, y

Then combine the information into [D, P, N] superimposed tensors
among them:
D is the point cloud D
P is Pillar’s index
N is the point cloud index of the Pillar

Next,
Use a simple version of pointNet (a linear layer containing Batch-Norm and ReLu) to generate (C, P, N) tensors, then generate (C, P) tensors, and finally produce (C, H, W) Point cloud feature image (2D).

2. Backbone
Can refer to the picture for calculation

3. Detection Head
PP uses Single Shot Detector for bounding-box calculations

--

--

Mike Chan

Hi! I am Unmanned Vehicle Engineer from Taiwan. Python & C++ Self-learner. Dog person. Love cool things likes Science, Data-Science, Psychology and Games.