An Overview of Human Pose Estimation with Deep Learning

An introduction to the techniques used in Human Pose Estimation based on Deep Learning.

Bharath Raj
Apr 28, 2019 · 9 min read

Written by Bharath Raj with feedback from Yoni Osin.

Image for post
Image for post
Photo by Alain Pham on Unsplash

A Human Pose Skeleton represents the orientation of a person in a graphical format. Essentially, it is a set of coordinates that can be connected to describe the pose of the person. Each co-ordinate in the skeleton is known as a part (or a joint, or a keypoint). A valid connection between two parts is known as a pair (or a limb). Note that, not all part combinations give rise to valid pairs. A sample human pose skeleton is shown below.

Image for post
Image for post
Left: COCO keypoint format for human pose skeletons. Right: Rendered human pose skeletons. (Source)

Knowing the orientation of a person opens avenues for several real-life applications, some of which are discussed towards the end of this blog. Several approaches to Human Pose Estimation were introduced over the years. The earliest (and slowest) methods typically estimating the pose of a single person in an image which only had one person to begin with. These methods often identify the individual parts first, followed by forming connections between them to create the pose.

Naturally, these methods are not particularly useful in many real-life scenarios where images contain multiple people.

Multi-Person Pose Estimation

  • The simple approach is to incorporate a person detector first, followed by estimating the parts and then calculating the pose for each person. This method is known as the top-down approach.
  • Another approach is to detect all parts in the image (i.e. parts of every person), followed by associating/grouping parts belonging to distinct persons. This method is known as the bottom-up approach.
Image for post
Image for post
Top: Typical Top-Down approach. Bottom: Typical Bottom-Up approach. (Image Source)

Typically, the top-down approach is easier to implement than the bottom-up approach as adding a person detector is much simpler than adding associating/grouping algorithms. It is hard to judge which approach has better performance overall as it really comes down to which among the person detector and associating/grouping algorithms is better.

In this blog, we will focus on multi-person human pose estimation using deep learning techniques. In the next section, we will review some of the popular top-down and bottom-up approaches for the same.

Deep Learning Methods

1. OpenPose

As with many bottom-up approaches, OpenPose first detects parts (keypoints) belonging to every person in the image, followed by assigning parts to distinct individuals. Shown below is the architecture of the OpenPose model.

Image for post
Image for post
Flowchart of the OpenPose architecture. (Source)

The OpenPose network first extracts features from an image using the first few layers (VGG-19 in the above flowchart). The features are then fed into two parallel branches of convolutional layers. The first branch predicts a set of 18 confidence maps, with each map representing a particular part of the human pose skeleton. The second branch predicts a set of 38 Part Affinity Fields (PAFs) which represents the degree of association between parts.

Image for post
Image for post
Steps involved in human pose estimation using OpenPose. (Source)

Successive stages are used to refine the predictions made by each branch. Using the part confidence maps, bipartite graphs are formed between pairs of parts (as shown in the above image). Using the PAF values, weaker links in the bipartite graphs are pruned. Through the above steps, human pose skeletons can be estimated and assigned to every person in the image. For a more thorough explanation of the algorithm, you may refer to their paper and to this blog post.

2. DeepCut

  1. Produce a set of D body part candidates. This set represents all possible locations of body parts for every person in the image. Select a subset of body parts from the above set of body part candidates.
  2. Label each selected body part with one of C body part classes. The body part classes represent the types of parts, such as “arm”, “leg”, “torso” etc.
  3. Partition body parts that belong to the same person.
Image for post
Image for post
Pictorial representation of the approach. (Source)

The above problems were jointly solved by modeling it into an Integer Linear Programming (ILP) problem. It is modeled by considering triples (x, y, z) of binary random variables with domains as stated in the images below.

Image for post
Image for post
Domains of the binary random variables. (Source)

Consider two body part candidates d and d' from the set of body part candidates D and classes c and c' from the set of classes C. The body part candidates were obtained through a Faster RCNN or a Dense CNN. Now, we can develop the following set of statements.

  • If x(d,c) = 1 then it means that body part candidate d belongs to class c.
  • Also, y(d,d') = 1 indicates that body part candidates d and d' belong to the same person.
  • They also define z(d,d’,c,c’) = x(d,c) * x(d’,c’) * y(d,d’). If the above value is 1, then it means that body part candidate d belongs to class c, body part candidate d' belongs to class c', and finally body part candidates d,d’ belong to the same person.

The last statement can be used to partition pose belonging to different people. Clearly, the above statements can be formulated in terms of linear equations as functions of (x,y,z). In this way, the Integer Linear Program (ILP) is set up, and the pose of multiple persons can be estimated. For the exact set of equations and much more detailed analysis, you can check out their paper here.

3. RMPE (AlphaPose)

Image for post
Image for post
Effect of duplicate predictions (left) and low confidence bounding boxes (right). (Source)

To resolve this issue, the authors proposed the usage of Symmetric Spatial Transformer Network (SSTN) to extract a high-quality single person region from an inaccurate bounding box. A Single Person Pose Estimator (SPPE) is used in this extracted region to estimate the human pose skeleton for that person. A Spatial De-Transformer Network (SDTN) is used to remap the estimated human pose back to the original image coordinate system. Finally, a parametric pose Non-Maximum Suppression (NMS) technique is used to handle the issue of redundant pose deductions.

Furthermore, the authors introduce a Pose Guided Proposals Generator to augment training samples that can better help train the SPPE and SSTN networks. The salient feature of RMPE is that this technique can be extended to any combination of a person detection algorithm and an SPPE.

4. Mask RCNN

Image for post
Image for post
Flowchart describing the Mask RCNN Architecture. (Source)

The basic architecture first extracts feature maps from an image using a CNN. These feature maps are used by a Region Proposal Network (RPN) to get bounding box candidates for the presence of objects. The bounding box candidates select an area (region) from the feature map extracted by the CNN. Since the bounding box candidates can be of various sizes, a layer called RoIAlign is used to reduce the size of the extracted feature such that they are all of the uniform size. Now, this extracted feature is passed into the parallel branches of CNNs for final prediction of the bounding boxes and the segmentation masks.

Let us focus on the branch that performs segmentation. Suppose an object in our image can belong to one among K classes. The segmentation branch outputs K binary masks of size m x m, where each binary mask represents all objects belonging to that class alone. We can extract keypoints belonging to every person in the image by modeling each type of keypoint as a distinct class and treating this like a segmentation problem.

Parallely, the objection detection algorithm can be trained to identify the location of the persons. By combining the information of the location of the person as well as their set of keypoints, we obtain the human pose skeleton for every person in the image.

This method nearly resembles the top-down approach, but the person detection stage is performed in parallel to the part detection stage. In other words, the keypoint detection stage and person detection stage are independent of each other.

5. Other Methods

Applications

1. Activity Recognition

  • Applications to detect if a person has fallen down or is sick.
  • Applications that can autonomously teach proper work out regimes, sport techniques and dance activities.
  • Applications that can understand full-body sign language. (Ex: Airport runway signals, traffic policemen signals, etc.).
  • Applications that can enhance security and surveillance.
Image for post
Image for post
Tracking the gait of the person is useful for security and surveillance purposes. (Image source)

2. Motion Capture and Augmented Reality

Image for post
Image for post
Example of CGI Rendering. (Source)

A good visual example of what is possible can be seen through Animoji. Even though the above only tracks the structure of a face, the idea can be extrapolated for the keypoints of a person. The same concepts can be leveraged to render Augmented Reality (AR) elements that can mimic the movements of a person.

3. Training Robots

4. Motion Tracking for Consoles

Image for post
Image for post
The Kinect sensor in action. (Source)

Conclusion

BeyondMinds

Reinventing Enterprise AI

Thanks to Yoni Osin

Bharath Raj

Written by

Exploring Computer Vision and Machine Learning | https://thatbrguy.github.io

BeyondMinds

BeyondMinds leads the way to hyper-customized AI products for enterprises. Based on BeyondMinds Modular Engine (‘BME’), we solve the inherent market challenge of working with unstructured data, facilitating self-supervised learning, maintaining and re-training of models

Bharath Raj

Written by

Exploring Computer Vision and Machine Learning | https://thatbrguy.github.io

BeyondMinds

BeyondMinds leads the way to hyper-customized AI products for enterprises. Based on BeyondMinds Modular Engine (‘BME’), we solve the inherent market challenge of working with unstructured data, facilitating self-supervised learning, maintaining and re-training of models

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store