Expressive 3D Human Pose and Shape Estimation, Part 2: Mesh Estimation and 3D Rotational Pose Prediction for the Whole Body

SNU AI
SNU AIIS Blog
Published in
12 min readApr 1, 2022

By Sue Hyun Park

In the Expressive 3D Human Pose and Shape Estimation series, we introduce our novel methods to compose 3D human pose and shape information from a single RGB image.

In part 1, we describe our work on 3D human pose estimation. We focus on localizing joints of human bodies and hands in the 3D space in order to lay the cornerstone for vital 2D-to-3D conversion techniques. These include accurately measuring a subject’s relative distance from the camera for multi-person scenarios and depicting the complex sequence of interacting hands.

In part 2, we describe how our work extends to estimating the human mesh, a widely used data format for 3D human shape representation. In the end, we simultaneously localize joints and mesh vertices of all human parts, including body, hands, and face, for more rich and comprehensive 3D figures. Reaching this state, we discuss how advanced 3D human pose and shape estimation methods can be applied to industries to lead the advent of new communication technologies.

If you would like to review the start of our discussion, read part 1.

Among point cloud, voxel, and mesh, mesh is the most popular 3D data format for 3D human shape representation. This is because connecting the mesh vertices represents the surface as a set of polygons and is adequate for the human body frame. Above all, the mesh format is known for its compactness and usability.

Different representations of 3D data: point cloud, voxel, and mesh in order (Source: Nabil MADALI)
Different representations of 3D data: point cloud, voxel, and mesh in order (Source: Nabil MADALI)

Using the mesh format, this blog, part 2, describes our 3D human pose and shape estimation methods that focus on delivering:

  • dense representation of human
  • additional shape representation of human
  • expressive representation of human, especially in the details of facial expressions and wrist/hand rotations
3D human pose and mesh estimation and Expressive 3D human pose and mesh estimation

We also share prospects for advanced 3D human pose and shape estimation applications, centered on future communication channels and human behavior understanding.

Refining 3D Human Pose & Mesh Estimation

Why a Lixel-based 1D Heatmap is Necessary

Previous approaches for 3D human pose and mesh estimation aim to produce mesh vertex coordinates from the input image. Conventional coordinate-based methods suffer from two drawbacks that are detrimental to test accuracy:

  • Breaking the spatial relationship between pixels in the input image, as the target representation is flattened to vectors in the output stage and renders the process a highly non-linear mapping
  • Inability to model the uncertainty of the prediction because the predicted coordinates are fixed values

For 3D human pose estimation, these drawbacks have already been solved by using the heatmap as a prediction target. This is because each value of one heatmap represents the likelihood of the existence of a human joint at the corresponding pixel positions of the input image and discretized depth value. Specifically, voxel-based 3D heatmap is widely used. Voxel is defined as a quantized cell in three-dimensional space.

However, extending to 3D human pose and mesh estimation, voxel-based 3D heatmap is inefficient for the dense mesh vertex localization. There are thousands of mesh vertices to handle, which makes predicting 3D heatmaps of all mesh vertices computationally infeasible. Using 3D heatmaps can escalate the GPU memory usage to its limit at a very fast rate — meaning voxel-based 3D and pixel-based 2D heatmaps are no good, as shown in the figure below.

Therefore, we propose to use lixel (line + pixel)-based 1D heatmap to represent each mesh vertex. The lixel structure has efficient memory complexity compared to its counterparts, and this having a linear relationship with the heatmap resolution allows our system to predict heatmaps with sufficient resolution.

Voxel vs. Pixel vs. Lixel, including the memory complexity comparison. V denotes # of vertices and D denotes # of pixels required for one vertex.
Voxel vs. Pixel vs. Lixel, including the memory complexity comparison. V denotes # of vertices and D denotes # of pixels required for one vertex.

I2L-MeshNet: Image-to-Lixel Prediction Network

Qualitative results of the proposed I2L-MeshNet on MSCOCO and Frei-HAND datasets.
Qualitative results of the proposed I2L-MeshNet on MSCOCO and Frei-HAND datasets.

Our proposed system, I2L-MeshNet, estimates per-lixel likelihood on 1D heatmaps for each mesh vertex coordinate. The process is a coarse-to-fine mesh prediction, where we predict the coarse human joints as the essential articulation information to finally produce fine mesh vertices. Hence, we design the I2L-MeshNet as a cascaded network architecture that consists of PoseNet and MeshNet:

  1. PoseNet predicts the lixel-based 1D heatmaps of each 3D human joint coordinate.
  2. MeshNet utilizes the output of the PoseNet as an additional input along with the image feature to predict the lixel-based 1D heatmaps of each 3D human mesh vertex coordinate.

We note that our system naturally extends heatmap-based 3D human pose to heatmap-based 3D human pose and mesh. In other words, 3D multi-person pose and mesh estimation is available by reusing our well-performing base 3D multi-person pose estimation framework with the initial PoseNet replaced by our proposed I2L-MeshNet. The base system’s RootNet will allow the new pose and mesh estimations to be extended to multi-person cases.

Overall pipeline of the proposed I2L-MeshNet.
Overall pipeline of the proposed I2L-MeshNet.

The bottom figure illustrates how the new PoseNet estimates three lixel-based 1D heatmaps — x,y-axis and z-axis — for each mesh vertex from image features extracted by ResNet. The MeshNet has a similar architecture.

The bottom figure illustrates how the new PoseNet estimates three lixel-based 1D heatmaps — x,y-axis and z-axis — for each mesh vertex from image features extracted by ResNet. The MeshNet has a similar architecture.
The bottom figure illustrates how the new PoseNet estimates three lixel-based 1D heatmaps — x,y-axis and z-axis — for each mesh vertex from image features extracted by ResNet. The MeshNet has a similar architecture.

Experiment

When our I2L-MeshNet and previous state-of-the-art methods were trained on the same datasets (left table), our I2L-MeshNet significantly outperforms by a large margin. In contrast, when trained on different datasets (right table), the performance gap between ours and previous works is reduced. This is because model-based methods like HMR and SPIN benefit from in-the-wild 2D human pose datasets. The pre-defined human models assume a prior distribution unlike model-free approaches like ours, and the resulting 2D pose-based weak supervision facilitates prediction for in-the-wild images that have diverse appearances. This highlights the need to train a larger number of in-the-wild image-3D mesh data for generalization purposes.

(Left) The MPJPE and PA MPJPE comparison on Human3.6M and 3DPW. All methods are trained on Human3.6M and MSCOCO. (Right) The MPJPE and PA MPJPE comparison on Human3.6M. Each method is trained on different datasets.
(Left) The MPJPE and PA MPJPE comparison on Human3.6M and 3DPW. All methods are trained on Human3.6M and MSCOCO. (Right) The MPJPE and PA MPJPE comparison on Human3.6M. Each method is trained on different datasets.

Nevertheless, our I2L-MeshNet significantly outperforms all previous works without groundtruth scale information during inference time. Below are some qualitative results.

The PA MPVPE, PA MPJPE, and F-scores comparison between state-of-the-art methods and the proposed I2L-MeshNet on FreiHAND. The checkmark denotes a method use groundtruth information during inference time.
The PA MPVPE, PA MPJPE, and F-scores comparison between state-of-the-art methods and the proposed I2L-MeshNet on FreiHAND. The checkmark denotes a method use groundtruth information during inference time.
Estimated meshes comparisons between our I2L-MeshNet and GraphCMR.
Estimated meshes comparisons between our I2L-MeshNet and GraphCMR.

Our I2L-MeshNet also won first place in the unknown association track of the 3DPW challenge on joint orientation.

See our released I2L-MeshNet code here.

The Full Package: Developing Expressiveness

Key Concepts

Expressive 3D human pose and mesh estimation differs from our previous studies in that it attempts to capture all human parts, including hands and the face, both of which are most responsible for conveying human intention and feeling. The simultaneous recovery of all human parts is enabled by a strong statistical parametric model like SMPL-X, which requires pose, shape, and facial expression parameters to produce 3D expressive mesh. We construct a new framework that effectively predicts SMPL-X parameters, implementing a regression-based approach proven to be fast and reliable by ExPose. Note that we are taking advantage of a unified human model to boost prediction accuracy for in-the-wild images, overcoming the I2L-MeshNet’s weakness mentioned earlier.

Here we introduce three crucial features to estimate:

  1. 3D Positional pose contains features on the position of human joints, providing 3D geometric evidence.
  2. Joint-level contextual features can be obtained from the 3D positional pose and provide global contextual information around human joints.
  3. 3D Rotational pose contains features on 3D rotations of human joints and is a key characteristic that differentiates the 3D shape of body parts.

3D rotational poses comprise the final pose parameters we want to regress. What matters is how well we can predict it. The first important factor is to fully capture human articulation information for predicting the 3D rotational pose. Recent works either break the spatial domain by performing global average pooling (GAP) on the instance-level image feature or use an error-prone 3D positional pose to directly predict the 3D rotational pose. The second important factor is to accurately capture 3D hands, which is key for expressiveness. For the 3D wrist rotations, using both body and hand features is necessary for anatomically plausible and smooth connection with the body joints. But for the 3D finger rotation, the same approach can be hazardous; hands occupy a small portion of the body feature, generating very coarse hand information that degrades the quality of combined information.

ExPose outputs implausible 3D wrist rotations due to the absence of knowledge of body features. The right shows false results when hands are occluded.
ExPose outputs implausible 3D wrist rotations due to the absence of knowledge of body features. The right shows false results when hands are occluded.

We come up with novel methods to overcome such shortcomings. First, we capture much more human articulation information than previous works by using a combination of 3D positional pose and joint-level contextual features, preserving the spatial domain and exploiting contextual cues. Second, we capture accurate 3D hands with better 3D wrist and finger rotations. We predict 3D wrist rotations from a combination of body and eight hand MCP (i.e., four finger root joints except for a thumb root joint for each hand) 3D positional pose and joint-level contextual features. In addition, we leave out coarse hand information from the body image and use only fine hand information from the hand images.

Visualization of a 3D wrist rotation in three axes. The red circles represent four hand MCP joints. 3D wrist rotation is highly related to four hand MCP joints as they are child nodes of a wrist in the hand kinematic chain.
Visualization of a 3D wrist rotation in three axes. The red circles represent four hand MCP joints. 3D wrist rotation is highly related to four hand MCP joints as they are child nodes of a wrist in the hand kinematic chain.

Pose2Pose: a 3D Positional Pose-guided 3D Rotational Pose Prediction Framework

Qualitative results of the proposed Pose2Pose on MSCOCO. Gender is only used for visualization.
Qualitative results of the proposed Pose2Pose on MSCOCO. Gender is only used for visualization.

We present Pose2Pose, a 3D positional pose-guided 3D rotational pose prediction framework for expressive 3D human pose and mesh estimation. Our system has two main modules that rest on the key concepts:

  1. PositionNet predicts the 3D positional pose from an input image in a fully convolutional way. Then, a positional pose-guided pooling extracts the joint-level contextual features on the predicted positional pose of the image feature.
  2. RotationNet predicts 3D rotational pose from the 3D positional pose and joint-level features.

Our Pose2Pose consists of body, hand, and face branches, which take a cropped body, hands, and face image, respectively. The outputs of each branch are 3D human model parameters of each part, which are fed to the SMLP-X layer to obtain the final expressive 3D human pose and mesh. Our system is trained in an end-to-end manner.

The overall pipeline of Pose2Pose for expressive 3D human pose and mesh estimation.
The overall pipeline of Pose2Pose for expressive 3D human pose and mesh estimation.

The SMPL-X model is defined by a function M(θ,β,ψ) that produces a 3D mesh M for the human body. We define the parameters as below.

  • θ: 3D rotational pose parameters — θ_f for the jaw joints, θ_{rh}, θ_{lh}for the right and left hand joints, θ_b for the remaining body including wrists, neck, and head, and θ_b^g for the global body rotation
  • β_b: the joint body, face, and hands shape parameter
  • ψ∈\R^{10}: facial expression parameter

The shape parameter β_b is predicted from an image feature using GAP and a fully connected layer. The 3D jaw rotation θ_f and facial expression parameter ψ are directly regressed using ResNet-18 and a fully connected layer, as the face keypoints do not move according to 3D rotation joints.

We will further describe how the 3D positional pose-guided 3D rotational pose prediction scheme is utilized in the body branch.

Body branch

First, the PositionNet extracts 2D image feature map F_b and predicts 3D heatmaps of human body joints H_b from it. The 3D positional pose P_b is calculated from H_b by soft-argmax operation. In the meantime, the body branch also predicts hand and face bounding boxes to be used as input in the respective branches.

Next, the positional pose-guided pooling computes the joint-level features F_b^P using the predicted 3D positional pose P_b.

Finally, the RotationNet takes a vector v_b, a concatenation of the flattened 3D positional pose P_b and flattened joint-level features F_b^P. Once the hand branch obtains vector v_m in a similar fashion for the hand MCP joints, vector v_m is passed to the RotationNet. The final 3D body pose parameters are derived from a concatenation of vector v_b and vector v_m.

Illustration of the positional pose-guided pooling in the body branch. For simplicity, we describe only the right elbow and ankle.
Illustration of the positional pose-guided pooling in the body branch. For simplicity, we describe only the right elbow and ankle.

Experiment

We show that taking MCP features in the body branch is necessary for accurate 3D elbow and wrist rotation prediction.

Ours produces more accurate 3D wrist rotations by taking MCP joint features in the body branch than (Zhou et al. 2021) that only takes body features for 3D wrist rotation.
Ours produces more accurate 3D wrist rotations by taking MCP joint features in the body branch than (Zhou et al. 2021) that only takes body features for 3D wrist rotation.

Moreover, our Pose2Pose largely outperforms all previous state-of-the-art methods on the EHF dataset, lowering errors by a large margin. Compared to ExPose, ours produces highly stable results in 3D wrist rotations and the overall body and hand.

3D errors comparison on EHF. The numbers in hands are averaged values of left and right hands.
3D errors comparison on EHF. The numbers in hands are averaged values of left and right hands.
Qualitative results comparison between the proposed Pose2Pose and ExPose on MSCOCO validation set.

On AGORA validation/test images, our Pose2Pose gives the best output.

Leaderboard of the MPI AGORA Evaluation
Leaderboard of the MPI AGORA Evaluation

Promising Applications

What opportunities can expressive 3D human pose and mesh estimation methods offer? We are looking forward to more synchronous, vivid communication and interaction, in extended reality and in real life.

The More Interactive Metaverse through VR Communication

Metaverse has been booming since quarantine and social distancing forced our lives online. The fact is that the status quo — people embodying personas in the virtual environment for entertainment — is just the beginning of the extensive hybrid experience. The next stepping stone will be the capability to render real-time actions seamlessly into the platform interface. Instead of manually manipulating characters and avatars, we can take full control of our actions and synchronize them in the metaverse. This will eradicate the barriers of distance, allowing us to engage in more natural, comfortable interactions and feel more present with people. And this is every metaverse platform’s pursuit.

We anticipate that VR communication driven by 3D human pose and mesh estimation technology can be the medium. Facebook is steadily making progress in eye and face tracking to animate photo-realistic avatars of a user. Extending such authentic projection to the full-body and hands necessitates more sophisticated, accurate 3D pose and shape representation, and we believe expressive 3D human pose and mesh estimation technology will guide the way.

Facebook’s Oculus has introduced an API for building and testing apps that blend the real and virtual worlds. (Source: Oculus)
Photo-realistic telepresence can be achieved with realtime social interactions in AR/VR avatars that look, move, and sound just like you. (Source: gfycat @darthbuzzard, original video by UW Reality Lab)
Photo-realistic telepresence can be achieved with realtime social interactions in AR/VR avatars that look, move, and sound just like you. (Source: gfycat @darthbuzzard, original video by UW Reality Lab)

Beyond Human Action Recognition, Towards Human Behavior Understanding

The ability to capture detailed human parts that signal human intention and emotion can be welcoming in various domains. Surveillance systems, rehabilitation/therapeutic services, and fitness assistant solutions are utilizing human pose estimation technologies to recognize human actions. What’s more, AI might be able to mimic or generate realistic actions by understanding the motives in human actions. Existing AI audio assistants would be able to prepare assistance in advance. Or, non-player characters (NPCs) in games could take actions in response to the player’s motives.

Conclusion

3D human pose and mesh estimation requires careful decision on the format for representing mesh vertices, because the target density vastly exceeds that of joints. We thus propose lixel-based 1D heatmap as the prediction target, where the lixel structure enables both memory efficiency and sufficient resolution. It also preserves spatial relationship in the input image and models the uncertainty of the prediction, leading to highly accurate 3D shape prediction.

Expressive 3D human pose and shape estimation requires a holistic approach while retaining high accuracy for each body part. We use joint-level contextual features to ensure cohesion, but for accurate 3D finger rotations, we use only the fine hand information.

What could be our next milestone? Maybe the prediction target could be transformed into a fully-clothed person instead of a naked body. This would prompt a spike in hyper-personalization and self-expression in the metaverse. Or the input data could be videos instead of an RGB image.

Across the methodology and the target area, the 3D human pose and shape estimation field is rapidly progressing. As this task advances to a higher level, we can expect shifts in the way people connect and interact.

We hope our work promotes a better understanding of human action and behavior to serve amusing experiences in the changing world.

Acknowledgment

This blog post is based on the following papers:

  • Moon, Gyeongsik, and Kyoung Mu Lee. “I2L-MeshNet: Image-to-Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image.” ECCV. 2020. (arXiv, code)
  • Moon, Gyeongsik, and Kyoung Mu Lee. “Pose2Pose: 3D Positional Pose-Guided 3D Rotational Pose Prediction for Expressive 3D Human Pose and Mesh Estimation.” 2021. (arXiv)

We would like to thank Gyeongsik Moon for providing valuable insights to this blog post.

This post was originally posted on our Notion blog, at August 24, 2021.

--

--

SNU AI
SNU AIIS Blog

AIIS is an intercollegiate institution of Seoul National University, committed to integrate and support AI related research at Seoul National University.