Pointnet
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation: Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas

The paper explores deep learning architecture which is capable of reasoning and can learn features about 3D geometric data.
The main problem with point cloud deep learning is that typical convolutional architecture requires highly regular input data format, like image or temporal features. As pointcloud are not in regular format, the common approaches are to transform the data to regular 3D voxel grid or projections.
Pointnet was the initial approach for novel type of neural network that directly consumes unordered point clouds, which also takes care of the permutation invariance of points in the point cloud. Pointnet can do object classification, part segmentation, to scene semantic parsing. The main feature of Pointnet is the network is robust with respect to input perturbation and corruption. Also, the network can learn to summarize a shape by a sparse set of key points.
In basic version, each point is represented by just its 3 coordinates (x, y, z) and additional dimensions can be added like normals, intensity etc. depending on the availability.
Application of Pointnet
- Object Classification: The input pointcloud is either directly sampled or pre-segmented from a scene and the network will predict k scores for all the k candidate classes.
- Semantic Segmentation: The input can be a single object for a part region segmentation, or a sub-volume from a 3D scene for object region segmentation and the model will predict n x m scores for each of the n points and each of the m semantic subcategories.
Unordered Point Cloud:
Point cloud is an unordered set of vectors as seen from a data structure point of view, unlike pixel array in images. So the network that consumes N 3D point sets needs to be invariant to N! permutations of the input set in data feeding order.

Interaction among points:
The points are from a space of distance metrics, so the neighboring points should form a meaningful subset, so that the model needs to capture the local structure from neighboring points and the combinatorial interactions among the local structure.
Invariance under transformation:
The learned representation of the pointset needs to be invariant to certain transformation like rotation and translation.
f({x1, x2, …, xn}) ~ g(h(x1), h(x2),…, h(xn))
Pointnet approximates h by a multilayer perceptron network and g by a composition of a single variable function and a max pooling function.
Spatial Transformers : Pointnet has a data-dependent spatial transformer network that attempts to canonicalize the data by applying rigid or affine transformation to, so each point will transform independently.
Architecture

The network has 3 key modules:
- Max Pooling layer as a symmetric function to aggregate features from all the points.
Symmetric Function:
Three strategies exist to make the pointcloud invariant to input permutations,
- Sort input into a canonical order.
- Treat the input as a sequence to train an RNN, but augment the training data by all kinds of permutations.
- Use a simple symmetric function to aggregate the information from each point. Examples of symmetric binary functions are + and *.
The most important key in Pointnet is the use of single symmetric function, MaxPooling. Here the symmetric function takes n vectors as input and outputs a new vector that is invariant to the input order.
2. A local and global information combination structure.
The output from the symmetric function forms a global signature of input set which forms a vector [f1, f2, f3….fk]. These features can be used to train multi-layer perceptron for classification. Also, it can used for segmentation by combining the local and global feature vector, so the network is aware of both local and global features. It can also be used to predict per point normal.
3. Two joint alignment networks that aligns both input points and point features.
The network predictions need to be invariant to the point clouds which went through certain geometric transformations. So, Pointnet aligns all the input set to a canonical space before feature extraction.
Pointnet predicts an affine transformation matrix by a mini-network and directly applies transformation to coordinates of point cloud.

The same idea is further extended to feature space ,as well. Pointnet has another alignment network and predicts the feature transformation matrix to align the features from different point clouds.

Transformation in feature space will have high dimensions, which can increase the difficulty of optimization. So, we need to add a regularization term in loss which constrains the feature transformation matrix to be close to orthogonal matrix.
Analysis


Pointnet is a deep neural network that consumes the 3D pointcloud directly and provides a unified approach for tasks as classification and segmentation.
Reference
- PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation (Paper)
