Edge-selective Feature Weaving for point cloud matching (MIRU2022)
We are happy to announce that one of our project will be presented in a long oral session at MIRU2022.
This is an achievement by our internship trainee, Rintaro Yanagi (Ph.D candidate at Hokkaido University).
Matching, a task to identify the one-to-one correspondence between visual elements, is a fundamental image processing problem. In this research, we address the problem of 3D point cloud matching (Figure 1), a task to find a point-by-point correspondence between two 3D point clouds. 3D point cloud matching can be used for fine-level human pose tracking, as shown in Fig. 1, time-series point tracking in LiDAR observation. Automatic driving and robotics are typical industrial applications of this technology.
Our method has impressively boosted the performance of CorrNet3D , a SoTA method at CVPR2021, by replacing a correspondence indicator module with a trainable one (Figure 2).
 Yiming Zeng et al., “CorrNet3D: Unsupervised End-to-End Learning of Dense Correspondence for 3D Point Clouds,” CVPR2021
Why our method works better?
3D point cloud matching is generally realized by two modules: a feature extractor and a correspondence matrix identifier. The former extracts the feature of each point in the two point-clouds. Then, similarities of the extracted features are fed to the latter module. The proposed method boosts the performance by innovation in the latter part (Figure 3).
The problem of finding one-to-one correspondence from pairwise similarity is known as bipartite graph matching (a.k.a linear assignment problem). We can solve the problem optimally by the Hungarian algorithm; however, the algorithm is non-differentiable and not applicable for training a neural network. Thus, a differentiable substitute, such as the Sinkhorn operation, is used to train the feature extractor in an end-to-end manner.
It looks elegant to cut out the bipartite graph matching as a subproblem of the task; however, the optimal solution for the separated subproblem is not necessarily optimal for the entire problem. For example, no matter how smart the feature extractor is, the extracted features may have deviations from the true values due to sensor noise or data loss caused by occlusion. However, the conventional algorithm adopted for the subproblem assumes no errors in the input feature similarity, which may be suboptimal.
In this research, we have developed a data-driven algorithm for the sub-problem, under the assumption that the data-driven algorithm would have a chance to reach closer to the optimal solution in the entire system even when features are erroneous. Based on this intention, we developed a novel neural network architecture as a trainable matching algorithm, achieving the significant performance gain in Fig. 2.
Further technical details
The architecture consists of two parts: “stacked feature weaving layers” and an “edge selection part.” The former part is a kind of graph neural network (GNN). Conventional GNNs involve operations to aggregate features to vertices at each layer. However, when the graph is dense, such feature aggregation has the side effect of smoothing the features of two adjacent vertices, which is not desirable for any tasks that need to distinguish node elements. Bipartite graphs are typical dense graphs. To enable the learning of GNNs on such bipartite graphs, we proposed a feature weaving operation that forwards the values to the next layer for each edge rather than aggregating for each node.
The latter part is prepared to fix the critical problem of feature weaving layers; it consumes a large memory space during training. In order to back-propagate errors, the layer requires keeping calculation results for every edge. Thus, for matching a point cloud with 1024 points, a single layer would require to store 2*102⁴²≒1M linear operations (e.g., when input and output are 32-dimensional vectors, 1G for each layer). To save memory, we applied edge-pruning before the feature weaving part. We found that the pruning reduced not only the memory consumption but also the diversity of the input distribution at each layer, resulting in more accurate matching than a model without the pruning.
Any suggestion for the industrial use of this method is welcome. In that case, please contact us at email@example.com.
At OMRON SINIC X, we will continue fundamental research on natural language processing, computer vision, machine learning, robotics, and human computer interaction. If you are interested in working with us as an intern, please visit our call for internship page.
This study was conducted on the AI Bridging Cloud Infrastructure (ABCI), AIST. This work was supported by the JST-Mirai Program Grant Number JPMJMI21G2 and the JSPS KAKENHI Grant Numbers 21H04910.