Interpreting 3D Point Cloud Data Requiring Little Training Data via Contrastive Learning

Cyrill Stachniss
StachnissLab
Published in
4 min readSep 21, 2022
Semantic segmentation of a 3D LiDAR scan from an a urban scene in KITTI

Fine-grained scene interpretation is crucial for a robot or autonomous vehicle in order to understand the surroundings and act appropriately. Often, this task is addressed using cameras observing the robot’s surroundings and interpreting the images using neural networks. Extensive research on convolutional neural networks or short CNNs has provided solutions to 2D image-based semantic segmentation, instance segmentation, or panoptic segmentation. We can do something similar if installing a 3D laser scanner, also called LiDAR, on the vehicle, leading to an interpreted scene in 3d space. Semantic information in 3D supports robots to interact safely with their surroundings.

LiDAR data is often more challenging to interpret than images due to the lack of color information and the sparser representation of the objects. Thus, fairly large amounts of training are needed. Supervised methods need a substantial amount of manually labeled 3D [point clouds to achieve high performance, which is particularly hard to acquire for real-world LiDAR data. Data annotation is expensive due to the density of measurements, which are sensor-specific and also depend on the sensor mounting. For semantic segmentation, point-wise labels are neded, making it even harder to annotate 3D points compared to images. There are also fewer labeled LiDAR datasets available compared to image datasets. Given the sensor-specific characteristics, learned features from one dataset cannot be transferred easily to a dataset with a different sensor setup. Therefore, recent work in this domain address label efficiency or transferability across datasets.

Self-supervised representation learning has the prospect of using the data in an unsupervised way to learn a robust feature presentation that can be transferred to different supervised downstream tasks. One area of the self-supervised learning methods that received increasing attention recently is contrastive learning. Contrastive learning methods take advantage of data augmentation to generate augmented versions of one anchor sample. They aim to learn a representation to approximate these augmented samples with similar features in the feature space and put it apart from different samples. Based on the pre-trained networks using this paradigm, the network is fine-tuned for different downstream tasks.

Semantic segmentation result when training using contrastive learning using only 0.1% of the data

A 2022 work by Lucas Nunes and colleagues called “SegContrast” addresses this challenge and aims at 3D point cloud feature representation learning through self-supervised segment discrimination available in paper form with code. The idea is to use two stages. The first stage is free of labeled data and operates self-supervised. It is only there to compute a good representation of the LiDAR data. This stage realizes clustering in the point cloud to segment the structures in the scene. It exploits a momentum encoder network and creates a feature bank during training. Point-wise features are computed for a pair of augmented point clouds, and a segment mapping is used to extract the segment-wise points and features to learn a representation for the paired segments. The second stage starts from the feature representation of the first stage and fine-tunes it to the final task, then using labeled data. At that stage, only a small amount of training data is needed. The paper shows good results with 1/1000th of the original data.

SegContrast overview

The main contribution of the work is a contrastive representation learning method for 3D LiDAR point clouds that is able to learn structural information. It extracts segments from the LiDAR data and uses contrastive learning to discriminate between similar and dissimilar structures. The learned feature representation is then used as a starting point for supervised fine-tuning, reducing the number of labeled training data needed. The results reported by Nunes et al. suggest that the method can better learn the structural information and a more descriptive feature representation during the self-supervised pre-training, surpassing previous point cloud-based contrastive methods in different evaluations. The evaluations show that the approach learns efficiently with fewer labels, reaches a competitive performance even when pre-training for fewer epochs, can better describe fine-grained structures, and the model is more transferable between different datasets.

For more information, see:

L. Nunes, R. Marcuzzi, X. Chen, J. Behley, and C. Stachniss, “SegContrast: 3D Point Cloud Feature Representation Learning through Self-supervised Segment Discrimination,” IEEE Robotics and Automation Letters (RA-L), vol. 7, iss. 2, pp. 2116–2123, 2022. doi:10.1109/LRA.2022.3142440

Paper: http://www.ipb.uni-bonn.de/pdfs/nunes2022ral-icra.pdf
Code: https://github.com/PRBonn/segcontrast

--

--