Deep Learning on 3D Meshes

A learned solution to node-level classification on irregular graphs via graph neural networks.

Anya Fries
Stanford CS224W GraphML Tutorials
7 min readJan 26, 2022

--

By Paul Aurel Diederichs and Anya Fries as part of the Stanford CS224W course project.

Deep learning has shown tremendous success on 2D visual recognition tasks. Tasks such as image classification and segmentation, deemed extremely difficult a decade ago, are now solved by neural networks with human-like performance. This success is attributed to convolutional neural networks (CNNs), which have replaced handcrafted descriptors. Driven by the success of CNNs on 2D images, researchers have extended CNNs to irregular graph data, such as 3D meshes. The extension of the convolutional operation to unstructured graph data is nontrivial as the notion of relative position is lost. Therefore, we have dedicated this blog post and a Colab to answering how CNNs can be generalized to irregular graphs. In particular, we discuss how to modify CNNs to predict segmentation masks on 3D meshes.

Why are CNNs not applicable?

Regular graphs defined on the euclidean have a fixed node ordering and neighborhood size. Generic graphs have no node ordering for neighbors, and their number of neighbors is variable.

Unfortunately, CNNs do not directly translate to irregular graphs. CNNs work in Euclidean space and on regular graphs, where neighboring nodes can be differentiated from one another via their relative position to the central node, and the number of neighboring nodes is constant. The convolutional operation leverages both of these properties to derive informative node embeddings. For irregular and unstructured graphs, these properties are not satisfied. Unstructured graphs have no notion of relative position — there is no top, bottom, right, or left. Additionally, the neighborhood connectivity/size can vary across nodes. Consequently, it is impossible to perform convolutional operations on irregular graphs.

Update formula for computing node embeddings h with a 3x3 CNN layer. Note that unique weight matrices W are learned for each neighbor. This is possible because neighbors can be differentiated using their relative position to the center node.

Graph Convolution Networks

Since vanilla convolutional operations do not apply to irregular graphs, the operation needs to be re-formulated. Graph convolution networks (GCNs) and graph convolution operations are such a re-formulation. Similar to traditional CNNs, GCNs reason over the local neighborhood. Like CNNs, GCNs compute embeddings by aggregating information from direct neighbors. Yet, GCNs ensure that this information aggregation operation does not depend on the node’s relative position, nor on the node ordering, and is independent of the node’s neighborhood size — invariances not enforced by CNNs.

  • Irregular graphs are node order invariant, so the output of a GCN must be independent of the graph’s node ordering. Specifically, the neural network’s output must be identical irrespective of the node ordering. We do not want the neural network to output two different values if fed two different node orderings of a graph. This property is known as permutation invariance. We illustrate it below.
  • The connectivity of nodes can differ across the graph. Thus, the graph convolution operation must be independent of the neighborhood size.
For different order plans, the learned function must map nodes to the same output irrespective of their order (permutation equivariance — illustrated), and must map graphs to the same vector (permutation invariance) irrespective of the graph’s node ordering.

A single GCN layer is defined by three operations: the message computation, the aggregation, and the update. These operations are performed to derive node embeddings.

  • Message computation: Message computation amounts to transforming the feature vector of each node in the graph. The applied learned transformation is common to all nodes to enforce permutation invariance.
  • Aggregation: Once all messages are derived, each node aggregates messages of its neighbors via a permutation equivariant function, e.g. mean, max, or sum function. This aggregation of messages enables the propagation of information across the graph.
  • Update: Finally, an update function combines the aggregated messages and the previous node embedding of the current node. Thereby the node’s embedding is updated. By combining information from the node’s neighborhood and the node itself, the derived node descriptor encodes structural and node-level information.
General message, aggregation and update formula for a GCN with node embeddings h. Messages are normalised by the node degree, the aggregation is a sum over the neighbors & the current node, and an activation is applied to update the embedding. Note that W does not depend on the current node or its neighbors.

For a more detailed explanation, have a look at lectures 6 and 7 of Stanford’s course on ML with graphs — CS224W.

Application

To explore the introduced graph convolution operation, we turn to a graph segmentation task. Similar to image segmentation, where pixels are assigned to semantic classes, graph segmentation amounts to finding a mapping between the nodes of a graph and a set of classes. We focus on the problem of assigning body-part labels to the nodes of humanoid 3D meshes.

Graph segmentation task: each vertex in the mesh is assigned to one of twelve body-parts.

3D Mesh Data

To solve the presented segmentation task, we leverage all data encoded in 3D meshes. A 3D mesh defines a surface via a collection of vertices and triangular faces. It is represented by two matrices:

  • A vertex matrix with dimensions (n, 3), where each row specifies the spatial position, [x, y, z] of a vertex in 3D-space.
  • A face integer matrix with dimensions (m, 3), where each row holds three indices of the vertex matrix that define a triangular face.

Note that the vertex matrix captures node-level feature information and the face matrix describes the node connectivity. Formally, each mesh can be transformed into a graph G = (X, A) with vertices V, where X has dimension (|V|, 3) and defines the spatial xyz-features for each node u in V, and A, the adjacency matrix, has dimension (|V|, |V|) and defines the connected neighborhood of each node. We work with this induced mesh graph.

Implementation

We rely on Feature-Steered graph convolutions to derive node-level body-part assignments from the induced mesh graph. This GCN layer was proposed by Verma et al. [1].

A node-level embedding update via Feature-Steered graph convolution.

The Feature-Steered graph convolution works as follows. The embedding of the central node (in red) is updated by aggregating transformed embeddings of neighboring nodes (in green) as well as transformed embedding of itself. In particular, M different messages are created for each node by passing its embedding through M weight matrices. Therefore, a total of M x Ni messages are created. These M x Ni are aggregated via a tunable weighted average. This can be expressed as:

where Nv is the set of adjacent vertices (neighbors) of vertex v (including v itself).

The M messages are weighted via a learnable attention mechanism. These are computed for each layer as follows:

where u and c are learnable parameters, specific to each layer. Note that this attention weight function is translation invariant, which is desirable when working with spatial input features. It ensures that the translation of an input mesh does not affect the computed attention weights.

Implementation in PyG

This graph convolution operation can be implemented using the PyTorch geometric (PyG) message passing class. Three class methods define a PyG message passing class. These are forward (performs a forward pass), message (constructs the node-level messages) and aggregate. The message and forward function are defined below:

The full implementation can be found in our Colab.

Network Architecture

The Feature-Steered graph convolution forms the backbone of our body-part labeling network. Our general network architecture is as follows:

  1. A multilayered perceptron encoder is first used to embed the node-level input feature vectors (xyz coordinates).
  2. The encoded node-level features are then sequentially passed through four Feature-Steered convolutional layers. These layers each aggregate messages from 12 attention heads (the number of body-part output labels).
  3. Finally, the refined node-level features are passed through a prediction head, namely a multi-perceptron prediction. It outputs a soft class correspondence for each node.

We train the neural network to minimize the cross-entropy loss between the predictions and the ground-truth segmentation labels. This loss is back-propagated through the entire network to update the parameters. The resulting training loop is defined below:

Finally, we evaluate the network’s performance by computing the accuracy between the predicted and ground-truth segmentation labels. This computation is shown below:

Dataset

We train our network on the MPI FAUST data-set [2]. It contains 100 watertight humanoid meshes. We generated segmentation labels by labeling a single mesh. These labels were transferred to all other meshes via ground-truth vertex correspondences included in the data-set. We use 80 meshes to train our neural network, and evaluate on 20 meshes.

10 poses of the humanoid meshes in the MPI FAUST data-set.

The network’s learning process on this dataset is visualized below. We see the class predictions converging to the correct body part label. The trained network achieved an accuracy of 95% on the test dataset.

Conclusion

In conclusion, we showed how to generalize the convolution operation from regular graphs defined on the Euclidean space to irregular graphs with arbitrary connectivity. This generalization led us to define GCNs, an extension of CNNs. Unlike traditional CNNs, GCNs can operate on irregular graphs. Thus, GCNs can be applied to diverse data — meshes as we illustrated, molecules, social networks, and even 2D images.

References

[1] Nitika Verma, Edmond Boyer, and Jakob Verbeek. FeaStNet: Feature-Steered Graph Convolutions for 3D Shape Analysis. 2018. arXiv: 1706.05206 [cs.CV].

[2] Federica Bogo et al. “FAUST: Dataset and evaluation for 3D mesh registration”. In: Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ, USA: IEEE, June 2014

--

--