Expressive 3D Human Pose and Shape Estimation, Part 1: Multi-person and Interacting Hand Pose
By Sue Hyun Park
Pose is a typical non-verbal expression of humans. As poses make up gestures and actions, the human pose information is important for human behavior understanding, human computer interaction, and AR/VR. Besides the body pose, poses of specific body parts like the hand are also important because hand motions can communicate intention or feeling that body-driven large motions cannot, and hands are widely used for interactions with objects. Human pose estimation is a computer vision task of detecting and analyzing human posture, technically by localizing semantic keypoints (i.e., joints) of human body parts in 3D space.
As the task has been studied for well over decades, you can meet many applications in real life that have pose estimation technologies integrated — motion captures in movies, fitness assistants, games, surveillance cameras, and so on.
A variety of services capture the volume and shape too.
There are several input sources to estimate 3D human pose and shape from, whether it be synthesized or real, a depth map or an RGB image, and/or from multiple views or just a single view. for the 3D rotation and mesh data can lead to more expressive figures of the human hand. The 3D human poses and shape estimation task expands from the pose-only case, but we work on slightly different aspects.
A 3D human pose and shape estimation model can be more adaptable to everyday cases when it is trained with a real single RGB image. After all, the cameras we normally use are RGB-based and we expect a single input for a single output. The shortcoming is that a flat 2D image has depth and scale ambiguity, making the human articulation process even more complicated.
In the following two blog posts, we introduce our novel methods to compose 3D human pose and shape information from a single RGB image despite the 2D-to-3D ambiguity.
In part 1, we describe our work on 3D human pose estimation. We focus on localizing joints of human bodies and hands in the 3D space in order to lay the cornerstone for vital 2D-to-3D conversion techniques. These include accurately measuring a subject’s relative distance from the camera for multi-person scenarios and depicting the complex sequence of interacting hands.
In part 2, we describe how our work extends to estimating the human mesh, a widely used data format for 3D human shape representation. In the end, we simultaneously localize joints and mesh vertices of all human parts, including body, hands, and face, for more rich and comprehensive 3D figures. Reaching this state, we discuss how advanced 3D human pose and shape estimation methods can be applied in industries to lead the advent of new communication technologies.
This blog, part 1, is dedicated to 3D human pose estimation techniques that focus on delivering:
- compact representation of human
- essential articulation information of human
- multiple subjects information from multi-person and interacting hands images
Our Base Framework that Captures 3D Multi-person Poses
The Difficulty of Estimating Absolute Camera Depth
Most of the previous 3D human pose estimation methods estimate the 3D pose relative to the center joint of a human, commonly referred to as the pelvis. This reference point is coined as the root. The final 3D pose is obtained by adding the 3D coordinates of the root to the estimated root joint-relative 3D pose. However, this method only applies to single-person cases. In a camera-centered 3D space consisting of multiple persons, we also need to know the absolute distance of each person from the camera, or the root joint position of each person. The goal of multi-person pose estimation is to capture both root joint-relative 3D pose and root joint position each person has to produce absolute 3D poses.
Here is the main question: how can we extract 3D coordinates from a 2D input image when the depth and scale is ambiguous? We tackle this task step-by-step:
- Identify the human area.
- Estimate where each person is located in the 3D space, i.e., 3D root joint position.
- Estimate each person’s 3D pose, i.e., root-joint relative 3D pose.
For step 1, we select a top-down approach that detects humans and draws bounding boxes around each person. For step 3, as it is a sub-problem of 3D single-person pose estimation, we utilize a known model.
This task is all about step 2 — a newly posed challenge of predicting the absolute distance of each person from the camera. Previous approaches obtain the 3D root joint position by 3D-to-2D fitting, in which the distance between an estimated 2D pose and the subsequently projected 3D pose is minimized. However, this method is error-prone since it does not learn and refine upon image features. Thus, we propose a fully learning-based approach to estimate camera depth adjusted to contextual information.
Our Fully Learning-based Camera Distance-aware Top-down Approach
We propose a fully learning-based approach to estimate absolute root joint positions of multiple persons. The figure below is the pipeline of the proposed system consisting of three modules.
- DetectNet, a human detection network, detects the bounding boxes of humans in an input image, marking our system a top-down approach.
- RootNet, the proposed 3D human root localization network, takes the cropped human image from the DetectNet and estimates the camera-centered coordinates of the detected humans’ roots.
- PoseNet, a root-relative 3D single-person pose estimation network, takes the same cropped human image, produces 3D heatmaps for each joint, and estimates the root-relative 3D pose for each detected human.
Given a single real RGB image as input, our system outputs absolute camera-centered coordinates of multiple persons’ keypoints. Any existing human detection and 3D single-person pose estimation models can be plugged into our framework.
We use Mask R-CNN for DetectNet and the state-of-the-art model by Sun et al. for PoseNet. Our novel method for RootNet, the core module, will be explained next.
RootNet
The RootNet estimates the camera-centered coordinates of the human root R=(x_R, y_R, Z_R) from a cropped human image. To obtain them, RootNet separately estimates the 2D image coordinates (x_R, y_R) and the depth value (the distance from the camera) Z_R of the human root. The estimated 2D image coordinates of the root are back-projected to the camera-centered coordinate space using the estimated depth value, which becomes the final output of RootNet.
How can we infer the depth of the root joint from a single cropped RGB image? Derived from the pinhole camera model, we introduce a new distance measure k to define the depth between the camera and the root joint:
A_{real}: area of the human in real space, assumed as a constant area of 2m × 2m (meter²)
A_{img}: area of A_{real} in the image space (pixel²)
α_x,α_y: focal lengths divided by the per-pixel distance factors of x- and y- axes (pixel). These are camera-intrinsic parameters.
k approximates the absolute depth from the camera to the object using the ratio of the actual area and the imaged area of the object, given camera parameters.
The pitfall is that the actual camera depth may be different from what it appears to be according to the imaged area. Think about two factors that can largely affect the size of A_{img}:
- (a) Pose: The catcher and the batter in the left picture are at the same distance from the camera. But the catcher’s crouching pose makes his A_{img} smaller than A_{real}, making k overestimated.
- (b) Physique and looks: The girl with a ball in the right picture is closer to the camera than the standing man. But their height difference is indistinguishable from the imaged area and the girl’s A_{img} becomes smaller than A_{real}, making the girl’s k value overestimated.
To handle this issue, we design RootNet to interpret the pose and appearance in the image. RootNet outputs the correction factor γ from the image feature and corrects k. The examples above have k values higher than they are supposed to be, so γ > 1 and the corrected k value k/sqrt(γ) becomes closer to the real distance value.
As correction factor γ is determined only from the input image, it is a focal length-normalized value and does not rely on a specific camera setting. In other words, at inference time our system does not use any groundtruth information and thus any in-the-wild images can be lifted to the focal length-normalized 3D space.
Experiment
We conduct experiments on the largest 3D single-person pose benchmark, Human3.6M dataset, and the 3D multi-person pose estimation datasets, MuCo-3DHP and MuPoTS-3D datasets.
We compare our proposed system with the state-of-the-art 3D human pose estimation methods. First, on the Human3.6M dataset, our method achieves comparable performance despite not using any groundtruth information in inference time. The upper table shows MPJPE (mean per joint position error) results and the lower table shows PA MPJPE (MPJPE after further alignment) results in different experimental protocols. It is worth noting that under the same setting (“without groundtruth information in inference time”), we achieve significantly better estimation accuracy. Considering that previous methods perform coordinate regression, we attribute the performance gain to our PoseNet’s 3D heatmap representation. This is an important finding for our subsequent 3D human pose and mesh estimation approach.
Second, on the MuCo-3DHP and MuPoTS-3D datasets, our proposed system significantly outperforms SOTA methods in most of the test sequences and joints. The table below shows a sequence-wise 3DPCK_{rel} comparison, the metric defined as the3D percentage of correct keypoints after root alignment with groundtruth.
On in-the-wild images, our proposed method shows impressive qualitative results as well.
Into a Complex Body Part: Interacting Hands
Human hands are a critical body feature in that we use hands to interact with objects and other people. Being said so, a good 3D human hand pose estimation model should be able to cover all realistic hand postures through both single-hand and interacting-hands sequences training.
The obstacle is that apart from existing single hand scenarios, quality data with 3D interacting hand joints coordinates annotations is lacking. Especially datasets based on real single RGB images are limited by small scale and low image resolution, suggesting the need for a finer dataset.
Therefore, we firstly propose a large-scale dataset InterHand2.6M with the following specifications:
- a variety of single-hand and interacting-hands sequences which are 2.6 million frames in total and are captured from 26 unique subjects (19 male, 7 female)
- composed of real-captured RGB images
- high 512 × 334 image resolution of the hand area (downsized from the initial 4096 × 2668 resolution due to fingerprint privacy purposes)
- accurate and less jittering 3D hand joints coordinates annotations generated by a semi-automatic method
Moreover, we propose a baseline network, InterNet, capable of simultaneously estimating 3D single and interacting hand pose from a single RGB image.
We will describe how we captured and annotated hand images. Next, we will continue on the pipeline of InterNet.
InterHand2.6M
Data Capture
We define two types of hand sequences and choose a variety of poses and conversational gestures.
- peak pose (PP) is a short transition from neutral pose to pre-defined hand poses and then transition back to neutral pose. Pre-defined hand poses include sign languages and extreme poses (e.g., fingers maximally bent).
- range of motion (ROM) represents conversational gestures followed by minimal instructions.
In essence, the proposed InterHand2.6M covers a reasonable and general range of hand poses instead of choosing an optimal hand pose set for specific applications.
Annotation
The hardship of annotating keypoints of hands from a single 2D image arises because unimportant features frequently occlude certain points. For instance, the skin can occlude the rotation center of a joint; fingers can occlude other fingers; or, the view itself has an oblique angle. Therefore, we develop a 3D rotation center annotation tool that allows a human annotator to view and annotate 6 images simultaneously. We also implement machine annotation to accelerate the process and make up for any mistake from human annotation. The proposed semi-automatic approach is a two-stage procedure:
1) Manual human annotation leveraged by our annotation tool
Annotators manually annotate 2D hand joint positions in two views in the 3D space. Then our human annotation tool automatically triangulates the 2D human annotations to get 3D keypoints, which are projected to the remaining views.
2) Automatic machine annotation
Using the images annotated in the previous stage, we train a state-of-the-art 2D keypoint detector with EfficientNet as a backbone. The detector runs through unlabeled images and obtains 3D keypoints by triangulation.
InterNet
Our InterNet takes a high resolution cropped image I and extracts the image feature F using ResNet whose fully-connected layers are trimmed. From F, InterNet simultaneously predicts handedness, 2.5D right and left hand pose, and right hand-relative left hand depth. The final result is a 3D heatmap of each joint. Unlike direct regression of 3D joint coordinates which is a highly non-linear mapping, our InterNet makes learning easier and provides state-of-the-art performance.
1) Handedness estimation
To decide which hand is included in the input image, we design our InterNet to estimate the probability of the existence of the right and left hand.
2) 2.5D hand pose estimation
The 2.5D hand pose consists of the 2D pose in $x$- and $y$-axis and root joint (i.e., wrist) -relative depth in $z$-axis.
3) Right hand-relative left hand depth estimation
To lift 2.5D hand pose to 3D space, we normally obtain an absolute depth of the root joint from our previously proposed RootNet. However, when both right and left hands are visible in the input image, RootNet tends to output unreliable depth values. To resolve high depth ambiguity in the cropped image, we design InterNet to predict right hand-relative left hand depth by leveraging the appearance of the interacting hand from the input image. This relative depth can be used instead of the output of the RootNet.
Experiment
Compared to previous state-of-the-art 3D hand pose estimation methods, the proposed InterNet outperforms without relying on groundtruth information during inference time.
Even from general images from another dataset that consists of 2D groundtruth joints coordinates, our InterNet successfully produces good results.
See our released InterHand2.6M dataset here and our released InterNet here.
Wrapping Up Part 1
Solving 2D-to-3D ambiguity is the primary challenge in 3D human pose estimation from a single RGB image. For the multi-person case, we propose a 3D human root localization network RootNet to output absolute camera-centered coordinates of the roots with the image context into consideration. For the interacting-hand case, we figure out the right hand-relative left hand depth with the assistance of RootNet.
We expect our flexible 3D multi-person pose estimation framework to be used as a base framework for diverse purposes. With RootNet, it is easy to extend 3D single-person pose estimation techniques to absolute 3D pose estimation of multiple persons.
While the 3D pose is the essential articulation of the human, accentuating the shape enriches our understanding of human behavior. Say, fitting a 3D hand model to our dataset InterHand2.6M for the 3D rotation and mesh data can lead to more expressive figures of the human hand.
The 3D human pose and shape estimation task extends from the pose-only case. Our research continues.
Acknowledgment
This blog post is based on the following papers:
- Moon, Gyeongsik, Ju Yong Chang, and Kyoung Mu Lee. “Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image.” ICCV. 2019. (arXiv, RootNet code, PoseNet code)
- Moon, Gyeongsik, Shoou-i Yu, He Wen, Takaaki Shiratori, Kyoung Mu Lee. “InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image.” ECCV. 2020. (arXiv, InterHand2.6M homepage, InterNet code)
We would like to thank Gyeongsik Moon for providing valuable insights to this blog post.
This post was originally posted on our Notion blog, at August 23, 2021.