Expressive 3D Human Pose and Shape Estimation, Part 1: Multi-person and Interacting Hand Pose

Published in

SNU AIIS Blog

13 min readApr 1, 2022

By Sue Hyun Park

AI-based fitness apps (Source: Solution Analysts)

Pose is a typical non-verbal expression of humans. As poses make up gestures and actions, the human pose information is important for human behavior understanding, human computer interaction, and AR/VR. Besides the body pose, poses of specific body parts like the hand are also important because hand motions can communicate intention or feeling that body-driven large motions cannot, and hands are widely used for interactions with objects. Human pose estimation is a computer vision task of detecting and analyzing human posture, technically by localizing semantic keypoints (i.e., joints) of human body parts in 3D space.

As the task has been studied for well over decades, you can meet many applications in real life that have pose estimation technologies integrated — motion captures in movies, fitness assistants, games, surveillance cameras, and so on.

Real-time moving avatar (Source: UnityList)

A variety of services capture the volume and shape too.

Virtual try-on and AR fitting (Source: FXMirror)

Interactive “blobs” with the Leap Motion Controller on the new portrait Looking Glass. (Source: Ultraleap)

There are several input sources to estimate 3D human pose and shape from, whether it be synthesized or real, a depth map or an RGB image, and/or from multiple views or just a single view. for the 3D rotation and mesh data can lead to more expressive figures of the human hand. The 3D human poses and shape estimation task expands from the pose-only case, but we work on slightly different aspects.

Comparison of hand images from datasets of Mueller et al., RHP, Simon et al., and FreiHAND. An RGBD image is a combination of an RGB image and its corresponding depth image.

A 3D human pose and shape estimation model can be more adaptable to everyday cases when it is trained with a real single RGB image. After all, the cameras we normally use are RGB-based and we expect a single input for a single output. The shortcoming is that a flat 2D image has depth and scale ambiguity, making the human articulation process even more complicated.

In the following two blog posts, we introduce our novel methods to compose 3D human pose and shape information from a single RGB image despite the 2D-to-3D ambiguity.

In part 1, we describe our work on 3D human pose estimation. We focus on localizing joints of human bodies and hands in the 3D space in order to lay the cornerstone for vital 2D-to-3D conversion techniques. These include accurately measuring a subject’s relative distance from the camera for multi-person scenarios and depicting the complex sequence of interacting hands.
In part 2, we describe how our work extends to estimating the human mesh, a widely used data format for 3D human shape representation. In the end, we simultaneously localize joints and mesh vertices of all human parts, including body, hands, and face, for more rich and comprehensive 3D figures. Reaching this state, we discuss how advanced 3D human pose and shape estimation methods can be applied in industries to lead the advent of new communication technologies.

This blog, part 1, is dedicated to 3D human pose estimation techniques that focus on delivering:

compact representation of human
essential articulation information of human
multiple subjects information from multi-person and interacting hands images

3D multi-person pose estimation and 3D human interacting-hand pose estimation

Our Base Framework that Captures 3D Multi-person Poses

The Difficulty of Estimating Absolute Camera Depth

Most of the previous 3D human pose estimation methods estimate the 3D pose relative to the center joint of a human, commonly referred to as the pelvis. This reference point is coined as the root. The final 3D pose is obtained by adding the 3D coordinates of the root to the estimated root joint-relative 3D pose. However, this method only applies to single-person cases. In a camera-centered 3D space consisting of multiple persons, we also need to know the absolute distance of each person from the camera, or the root joint position of each person. The goal of multi-person pose estimation is to capture both root joint-relative 3D pose and root joint position each person has to produce absolute 3D poses.

The root joint-relative 3D pose depicts a single person’s pose, while the absolute 3D pose determined by the help of the root joint position informs us of the spatial relationship between multiple persons.

Here is the main question: how can we extract 3D coordinates from a 2D input image when the depth and scale is ambiguous? We tackle this task step-by-step:

Identify the human area.
Estimate where each person is located in the 3D space, i.e., 3D root joint position.
Estimate each person’s 3D pose, i.e., root-joint relative 3D pose.

For step 1, we select a top-down approach that detects humans and draws bounding boxes around each person. For step 3, as it is a sub-problem of 3D single-person pose estimation, we utilize a known model.

Top: Typical Top-Down approach. Bottom: Typical Bottom-Up approach. (Source: KDNuggets)

This task is all about step 2 — a newly posed challenge of predicting the absolute distance of each person from the camera. Previous approaches obtain the 3D root joint position by 3D-to-2D fitting, in which the distance between an estimated 2D pose and the subsequently projected 3D pose is minimized. However, this method is error-prone since it does not learn and refine upon image features. Thus, we propose a fully learning-based approach to estimate camera depth adjusted to contextual information.

Our Fully Learning-based Camera Distance-aware Top-down Approach

We propose a fully learning-based approach to estimate absolute root joint positions of multiple persons. The figure below is the pipeline of the proposed system consisting of three modules.

DetectNet, a human detection network, detects the bounding boxes of humans in an input image, marking our system a top-down approach.
RootNet, the proposed 3D human root localization network, takes the cropped human image from the DetectNet and estimates the camera-centered coordinates of the detected humans’ roots.
PoseNet, a root-relative 3D single-person pose estimation network, takes the same cropped human image, produces 3D heatmaps for each joint, and estimates the root-relative 3D pose for each detected human.

Given a single real RGB image as input, our system outputs absolute camera-centered coordinates of multiple persons’ keypoints. Any existing human detection and 3D single-person pose estimation models can be plugged into our framework.

The overall pipeline of the proposed framework for 3D multi-person pose estimation from a single RGB image. The proposed framework can recover the absolute camera-centered coordinates of multiple persons’ keypoints.

We use Mask R-CNN for DetectNet and the state-of-the-art model by Sun et al. for PoseNet. Our novel method for RootNet, the core module, will be explained next.

RootNet

The RootNet estimates the camera-centered coordinates of the human root R=(x_R, y_R, Z_R) from a cropped human image. To obtain them, RootNet separately estimates the 2D image coordinates (x_R, y_R) and the depth value (the distance from the camera) Z_R of the human root. The estimated 2D image coordinates of the root are back-projected to the camera-centered coordinate space using the estimated depth value, which becomes the final output of RootNet.

How can we infer the depth of the root joint from a single cropped RGB image? Derived from the pinhole camera model, we introduce a new distance measure k to define the depth between the camera and the root joint:

A_{real}: area of the human in real space, assumed as a constant area of 2m × 2m (meter²)
A_{img}: area of A_{real} in the image space (pixel²)
α_x,α_y: focal lengths divided by the per-pixel distance factors of x- and y- axes (pixel). These are camera-intrinsic parameters.

k approximates the absolute depth from the camera to the object using the ratio of the actual area and the imaged area of the object, given camera parameters.

The pitfall is that the actual camera depth may be different from what it appears to be according to the imaged area. Think about two factors that can largely affect the size of A_{img}:

(a) Pose: The catcher and the batter in the left picture are at the same distance from the camera. But the catcher’s crouching pose makes his A_{img} smaller than A_{real}, making k overestimated.
(b) Physique and looks: The girl with a ball in the right picture is closer to the camera than the standing man. But their height difference is indistinguishable from the imaged area and the girl’s A_{img} becomes smaller than A_{real}, making the girl’s k value overestimated.

Examples where k fails to represent the distance between a human and the camera because of incorrect A_{img}. Red boxes indicate the obtained A_{img} and the blue boxes indicate A_{real}, which is the correct state of A_{img}.

To handle this issue, we design RootNet to interpret the pose and appearance in the image. RootNet outputs the correction factor γ from the image feature and corrects k. The examples above have k values higher than they are supposed to be, so γ > 1 and the corrected k value k/sqrt(γ) becomes closer to the real distance value.

As correction factor γ is determined only from the input image, it is a focal length-normalized value and does not rely on a specific camera setting. In other words, at inference time our system does not use any groundtruth information and thus any in-the-wild images can be lifted to the focal length-normalized 3D space.

Experiment

We conduct experiments on the largest 3D single-person pose benchmark, Human3.6M dataset, and the 3D multi-person pose estimation datasets, MuCo-3DHP and MuPoTS-3D datasets.

We compare our proposed system with the state-of-the-art 3D human pose estimation methods. First, on the Human3.6M dataset, our method achieves comparable performance despite not using any groundtruth information in inference time. The upper table shows MPJPE (mean per joint position error) results and the lower table shows PA MPJPE (MPJPE after further alignment) results in different experimental protocols. It is worth noting that under the same setting (“without groundtruth information in inference time”), we achieve significantly better estimation accuracy. Considering that previous methods perform coordinate regression, we attribute the performance gain to our PoseNet’s 3D heatmap representation. This is an important finding for our subsequent 3D human pose and mesh estimation approach.

MPJPE comparison with SOTA methods on the Human3.6M dataset using five subjects (S1, S5, S6, S7, S8) in training and two subjects (S9, S11) in testing. The best results are in bold. ∗ used extra synthetic data for training.

Second, on the MuCo-3DHP and MuPoTS-3D datasets, our proposed system significantly outperforms SOTA methods in most of the test sequences and joints. The table below shows a sequence-wise 3DPCK_{rel} comparison, the metric defined as the3D percentage of correct keypoints after root alignment with groundtruth.

Sequence-wise 3DPCK_{rel} comparison with state-of-the-art methods on the MuPoTS-3D dataset. ∗ used extra synthetic data for training.

On in-the-wild images, our proposed method shows impressive qualitative results as well.

See our released RootNet code here and PoseNet code here.

Into a Complex Body Part: Interacting Hands

Human hands are a critical body feature in that we use hands to interact with objects and other people. Being said so, a good 3D human hand pose estimation model should be able to cover all realistic hand postures through both single-hand and interacting-hands sequences training.

The obstacle is that apart from existing single hand scenarios, quality data with 3D interacting hand joints coordinates annotations is lacking. Especially datasets based on real single RGB images are limited by small scale and low image resolution, suggesting the need for a finer dataset.

Therefore, we firstly propose a large-scale dataset InterHand2.6M with the following specifications:

a variety of single-hand and interacting-hands sequences which are 2.6 million frames in total and are captured from 26 unique subjects (19 male, 7 female)
composed of real-captured RGB images
high 512 × 334 image resolution of the hand area (downsized from the initial 4096 × 2668 resolution due to fingerprint privacy purposes)
accurate and less jittering 3D hand joints coordinates annotations generated by a semi-automatic method

Moreover, we propose a baseline network, InterNet, capable of simultaneously estimating 3D single and interacting hand pose from a single RGB image.

We will describe how we captured and annotated hand images. Next, we will continue on the pipeline of InterNet.

InterHand2.6M

Data Capture

We define two types of hand sequences and choose a variety of poses and conversational gestures.

peak pose (PP) is a short transition from neutral pose to pre-defined hand poses and then transition back to neutral pose. Pre-defined hand poses include sign languages and extreme poses (e.g., fingers maximally bent).
range of motion (ROM) represents conversational gestures followed by minimal instructions.

In essence, the proposed InterHand2.6M covers a reasonable and general range of hand poses instead of choosing an optimal hand pose set for specific applications.

Visualization of some PP and ROM sequences

Annotation

The hardship of annotating keypoints of hands from a single 2D image arises because unimportant features frequently occlude certain points. For instance, the skin can occlude the rotation center of a joint; fingers can occlude other fingers; or, the view itself has an oblique angle. Therefore, we develop a 3D rotation center annotation tool that allows a human annotator to view and annotate 6 images simultaneously. We also implement machine annotation to accelerate the process and make up for any mistake from human annotation. The proposed semi-automatic approach is a two-stage procedure:

1) Manual human annotation leveraged by our annotation tool

Annotators manually annotate 2D hand joint positions in two views in the 3D space. Then our human annotation tool automatically triangulates the 2D human annotations to get 3D keypoints, which are projected to the remaining views.

2) Automatic machine annotation

Using the images annotated in the previous stage, we train a state-of-the-art 2D keypoint detector with EfficientNet as a backbone. The detector runs through unlabeled images and obtains 3D keypoints by triangulation.

Manually clicked joint positions (red circles), automatically re-projected points (green circles).

InterNet

Our InterNet takes a high resolution cropped image I and extracts the image feature F using ResNet whose fully-connected layers are trimmed. From F, InterNet simultaneously predicts handedness, 2.5D right and left hand pose, and right hand-relative left hand depth. The final result is a 3D heatmap of each joint. Unlike direct regression of 3D joint coordinates which is a highly non-linear mapping, our InterNet makes learning easier and provides state-of-the-art performance.

1) Handedness estimation

To decide which hand is included in the input image, we design our InterNet to estimate the probability of the existence of the right and left hand.

2) 2.5D hand pose estimation

The 2.5D hand pose consists of the 2D pose in $x$- and $y$-axis and root joint (i.e., wrist) -relative depth in $z$-axis.

3) Right hand-relative left hand depth estimation

To lift 2.5D hand pose to 3D space, we normally obtain an absolute depth of the root joint from our previously proposed RootNet. However, when both right and left hands are visible in the input image, RootNet tends to output unreliable depth values. To resolve high depth ambiguity in the cropped image, we design InterNet to predict right hand-relative left hand depth by leveraging the appearance of the interacting hand from the input image. This relative depth can be used instead of the output of the RootNet.

Experiment

Compared to previous state-of-the-art 3D hand pose estimation methods, the proposed InterNet outperforms without relying on groundtruth information during inference time.

EPE comparison with previous state-of-the-art methods on STB and RHP. The checkmark denotes a method that uses groundtruth information during inference time. S and H denote scale and handness, respectively. — EPE (end point error) comparison with previous state-of-the-art methods on STB and RHP. The checkmark denotes a method that uses groundtruth information during inference time. S and H denote scale and handness, respectively.

Even from general images from another dataset that consists of 2D groundtruth joints coordinates, our InterNet successfully produces good results.

Qualitative results on the dataset of Tzionas et al. and Demo of InterNet trained on InterHand2.6M

See our released InterHand2.6M dataset here and our released InterNet here.

Wrapping Up Part 1

Solving 2D-to-3D ambiguity is the primary challenge in 3D human pose estimation from a single RGB image. For the multi-person case, we propose a 3D human root localization network RootNet to output absolute camera-centered coordinates of the roots with the image context into consideration. For the interacting-hand case, we figure out the right hand-relative left hand depth with the assistance of RootNet.

We expect our flexible 3D multi-person pose estimation framework to be used as a base framework for diverse purposes. With RootNet, it is easy to extend 3D single-person pose estimation techniques to absolute 3D pose estimation of multiple persons.

While the 3D pose is the essential articulation of the human, accentuating the shape enriches our understanding of human behavior. Say, fitting a 3D hand model to our dataset InterHand2.6M for the 3D rotation and mesh data can lead to more expressive figures of the human hand.

The 3D human pose and shape estimation task extends from the pose-only case. Our research continues.

Read Expressive 3D Human Pose and Shape Estimation, Part 2

Acknowledgment

This blog post is based on the following papers:

Moon, Gyeongsik, Ju Yong Chang, and Kyoung Mu Lee. “Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image.” ICCV. 2019. (arXiv, RootNet code, PoseNet code)
Moon, Gyeongsik, Shoou-i Yu, He Wen, Takaaki Shiratori, Kyoung Mu Lee. “InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image.” ECCV. 2020. (arXiv, InterHand2.6M homepage, InterNet code)

We would like to thank Gyeongsik Moon for providing valuable insights to this blog post.

This post was originally posted on our Notion blog, at August 23, 2021.

Expressive 3D Human Pose and Shape Estimation, Part 1: Multi-person and Interacting Hand Pose

Our Base Framework that Captures 3D Multi-person Poses

The Difficulty of Estimating Absolute Camera Depth

Our Fully Learning-based Camera Distance-aware Top-down Approach

RootNet

Experiment

Into a Complex Body Part: Interacting Hands

InterHand2.6M

Data Capture

Annotation

InterNet

Experiment

Wrapping Up Part 1

Acknowledgment

Written by SNU AI