An In-Depth Look at PointNet

8 min readApr 12, 2019

PointNet [1] is a seminal paper in 3D perception, applying deep learning to point clouds for object classification and part/scene semantic segmentation. The original white-paper has been re-implemented with TensorFlow 2.0 and can be found at github.com/luis-gonzales/pointnet_own.

Input Data
Architecture
Permutation Invariance
Transformation Invariance
Analysis and Visualization
TensorFlow 2.0 Implementation

Input Data

Fig. 1: Point cloud visualization (source)

PointNet takes raw point cloud data as input, which is typically collected from either a lidar or radar sensor. Unlike 2D pixel arrays (images) or 3D voxel arrays, point clouds have an unstructured representation in that the data is simply a collection (more specifically, a set) of the points captured during a lidar or radar sensor scan. In order to leverage existing techniques built around (2D and 3D) convolutions, many researchers and practitioners often discretize a point cloud by taking multi-view projections onto 2D space or quantizing it to 3D voxels. Given that the original data is manipulated, either approach can have negative impacts.

For simplicity, it’ll be assumed that a point in a point cloud is fully described by its (x, y, z) coordinates. In practice, other features may be included, such as surface normal and intensity.

Architecture

Given that PointNet consumes raw point cloud data, it was necessary to develop an architecture that conformed to the unique properties of point sets. Among these, the authors emphasize:

Permutation (Order) Invariance: given the unstructured nature of point cloud data, a scan made up of N points has N! permutations. The subsequent data processing must be invariant to the different representations.
Transformation Invariance: classification and segmentation outputs should be unchanged if the object undergoes certain transformations, including rotation and translation.
Point Interactions: the interaction between neighboring points often carries useful information (i.e., a single point should not be treated in isolation). Whereas classification need only make use of global features, segmentation must be able to leverage local point features along with global point features.

Fig. 2: PointNet classification and segmentation networks (source)

The architecture is surprisingly simple and quite intuitive. The classification network uses a shared multi-layer perceptron (MLP) to map each of the n points from three dimensions (x, y, z) to 64 dimensions. It’s important to note that a single multi-layer perceptron is shared for each of the n points (i.e., mapping is identical and independent on the n points). This procedure is repeated to map the n points from 64 dimensions to 1024 dimensions. With the points in a higher-dimensional embedding space, max pooling is used to create a global feature vector in ℝ¹⁰²⁴. Finally, a three-layer fully-connected network is used to map the global feature vector to k output classification scores. The details on the “input transform” and “feature transform” are covered in the “Transformation Invariance” section below.

As for the segmentation network, each of the n input points needs to be assigned to one of m segmentation classes. Because segmentation relies on local and global features, the points in the 64-dimensional embedding space (local point features) are concatenated with the global feature vector (global point features), resulting in a per-point vector in ℝ¹⁰⁸⁸. Similar to the multi-layer perceptrons used in the classification network, MLPs are used (identically and independently) on the n points to lower the dimensionality from 1088 to 128 and again to m, resulting in an array of n x m.

The following sections will elaborate on the motivation/use of max pooling and the transformation networks.

Permutation Invariance

As mentioned, point clouds are inherently unstructured data and are represented as numerical sets. Specifically, given N data points, there are N! permutations.

In order to make PointNet invariant to input permutations, the authors turned to symmetric functions, those whose value given n arguments is the same regardless of the order of the arguments [2]. For binary operators, this is also known as the commutative property. Common examples include:

sum(a, b) = sum(b, a)
average(a, b) = average(b, a)
max(a, b) = max(b, a)

Specifically, the authors make use of the symmetric function once the n input points are mapped to higher-dimensional space, as shown below. The result is a global feature vector that aims to capture an aggregate signature of the n input points. Naturally, the expressiveness of the global feature vector is tied to the dimensionality of it (and thus the dimensionality of the points that are input to the symmetric function). The global feature vector is used directly for classification and is used alongside local point features for segmentation.

Fig. 3: Usage of max pool, a symmetric function (source)

PointNet implements the symmetric function with max pooling. The authors empirically tested alternatives, including summing and averaging, but found them to be inferior, as shown below.

Fig. 4: Empirical testing of symmetric functions (source)

Transformation Invariance

The classification (and segmentation) of an object should be invariant to certain geometric transformations (e.g., rotation). Motivated by Spatial Transformer Networks (STNs) [3], the “input transform” and “feature transform” are modular sub-networks that seek to provide pose normalization for a given input.

In order to appreciate the adoption of STNs in PointNet, let’s try to gain a high-level understanding of how they function. Shown in Fig. 5 are various inputs and corresponding outputs of a Spatial Transformer (ST). As can be seen, the ST provides pose normalization to an otherwise rotated input. Using this type of pose normalization in a digit classifier would relax the constraints of a downstream algorithm and reduce the extent to which data augmentation is needed. Pose normalization is beneficial in the case of point clouds as well, as objects can similarly take on an unlimited number of poses.

Fig. 5: Various inputs and corresponding outputs of a Spatial Transformer (source)

Taking a further look, Fig. 6 shows the components of the Spatial Transformer. Based on input U, a small regression network, the localization net, outputs transformation parameter θ. In order to construct output V given U and θ, a grid generator and sampler are used. To motivate the grid generator and sampler further, imagine that the output of a localization net corresponds to rotating a handwritten “7” by an angle θ; in order to create a new image with the proper rotation, the original image needs to undergo appropriate sampling. Note that the ST is not confined to the input space and can operate on any downstream feature/embedding space.

Fig. 6: Components of the Spatial Transformer (source)

Going back to PointNet, a similar approach can be taken: for a given input point cloud, apply an appropriate rigid or affine transformation to achieve pose normalization. Because each of the n input points are represented as a vector and are mapped to the embedding spaces independently, applying a geometric transformation simply amounts to matrix multiplying each point with a transformation matrix. Unlike the image-based application of Spatial Transformers, no sampling is needed. Fig. 7 shows a snapshot of the input transform. Similar to the localization net in STs, the T-Net is a regression network that is tasked with predicting an input-dependent 3-by-3 transformation matrix that is then matrix multiplied with the n-by-3 input.

The operations comprising the T-Net are motivated by the higher-level architecture of PointNet. MLPs (or fully-connected layers) are used to map the input points independently and identically to a higher-dimensional space; max pooling is used to encode a global feature vector whose dimensionality is then reduced to ℝ²⁵⁶ with FC layers. The input-dependent features at the final FC layer are then combined with globally trainable weights and biases, resulting in a 3-by-3 transformation matrix.

The concept of pose normalization is extended to the 64-dimensional embedding space (“feature transform” in Fig. 2). The corresponding T-Net is nearly identical to that of Fig. 8 except for the dimensionality of the trainable weights and biases, which become 256-by-4096 and 4096, respectively resulting in a 64-by-64 transformation matrix. The increased number of trainable parameters leads to the potential for overfitting and instability during training, so a regularization term is added to the loss function. The regularization term is shown below and encourages the resulting 64-by-64 transformation matrix (represented as A below) to approximate an orthogonal transformation.

Analysis and Visualization

There is a considerable amount of intuition that can be drawn from the global feature vector. Firstly, the dimensionality of the vector, referred to by the authors as the bottleneck dimension and symbolized by K, relates directly to the expressiveness of the model, as mentioned previously. Naturally, a larger value of K leads to a more complex — and, likely, accurate — model, and vice versa. For reference, PointNet is designed with K=1024. Fig. 9 shows the accuracy of PointNet across K and number points comprising an input point cloud.

Also, recall that the feature vector was the result of a thoughtfully applied symmetric function (for permutation invariance). In particular, PointNet makes use of max pooling. Similar to using the max operator to compress multiple real-valued inputs to a single value, the output of max pooling compresses the n points of the input point cloud to a subset of points. In fact, at most K points can contribute to the global feature vector. The points that do contribute to and define the global feature vector are referred to as the critical point set and encode the input with a sparse set of key points.

Fig. 10: Visualization of critical point sets and upper-bound shapes (source)

Similar to how the output of the max operator is unchanged by inputs that are lesser than the true maximum, there exists a bound on input points that won’t impact the global feature vector. This bound is represented above by the upper-bound shape. Note that noise beyond the upper-bound shape alters the global feature vector but may not necessarily result in misclassification. In summary, the global feature vector is unchanged for points between the critical point set and the upper-bound shape, resulting in considerable robustness.

Finally, the robustness described above can be visualized in a more quantitative manner, as shown below. Missing data refers to deleting points from the input point cloud, whereas outlier refers to insertion of random/noisy points.