Human Pose Estimation

Shivanshu Yadav
14 min readMar 27, 2022

download Dataset from this #dataset Link: kaggle.com/datasets/niharika41298/yoga-poses-dataset?resource=download

Abstract

Pose estimation is a computer vision task that is usually tackled through deep learning. It is one of the most interesting areas of research that has gained a lot of traction because of its usefulness and versatility — it finds applications in a wide range of fields including gaming, healthcare, AR, and sports.
This report will give you a comprehensive overview of what human pose estimation is and how it works. Under this document, I’ll explain about what Human Pose Estimation(HPE).
Human pose estimation localizes body key points to accurately recognise the postures of individuals given an image. This step is a crucial prerequisite to multiple tasks of computer vision which include human action recognition, human tracking, human-computer interaction, gaming, sign languages, and video surveillance. Therefore, we present this survey project report to fill the knowledge gap and shed light on the research of 2D human pose estimation. A brief introduction is followed by classifying it as a single or multi-person pose estimation based on the number of people needed to be tracked. Then gradually the approaches used in human pose estimation are described before listing some applications and flaws facing in pose estimation. Following that, a centre of attention is given on briefly discussing research with a significant effect on human pose estimation and examining the novelty, motivation, architecture, the procedures (working principles) of each model together with its practical application and drawbacks, datasets implemented, as well as the evaluation metrics used to evaluate the model. This review is presented as a baseline for newcomers and guides researchers to discover new models by observing the procedure and architecture flaws of existing research.

Introduction

Human Pose Estimation (HPE) is a way of identifying and classifying the joints in the human body. Essentially it is a way to capture a set of coordinates for each joint (arm, head, torso, etc.,) which is known as a key point that can describe a pose of a person. The connection between these points is known as a pair. The connection formed between the points has to be significant, which means not all points can form a pair. From the outset, the aim of HPE is to form a skeleton-like representation of a human body and then process it further for task-specific applications.
However — There are three types of approaches to model the human body:
• Skeleton-based model
• Contour-based model
• Volume-based model

approaches are primarily in the area of computer vision, and it is used to understand geometric and motion information of the human body, which can be very intricate.
This section explores the two approaches: the classical approach and the deep learning-based approach to Human pose estimation. We will also explain how classical approaches fail to capture the geometric and motion information of the human body, and how deep learning algorithms such as the CNNs excel at it.
Bottom-up vs. Top-down methods
All approaches for pose estimation can be grouped into bottom-up and top-down methods.
• Bottom-up methods estimate each body joint first and then group them to form a unique pose. Bottom-up methods were pioneered with DeepCut (a method we will cover later in more detail).
• Top-down methods run a person detector first and estimate body joints within the detected bounding boxes.

Classical approaches to 2D Human Pose Estimation
Classical approaches usually refer to techniques and methods involving swallow machine learning algorithms. For instance, the earlier work to estimate human pose included the implementation of random forest within a “pictorial structure framework”. This was used to predict joints in the human body.
The pictorial structure framework (PSF) is commonly referred to as one of the traditional methods to estimate human pose. PSF contained two components:
• Discriminator: It models the likelihood of a certain part present at a particular location. In other words, it identifies the body parts.
• Prior: It is referred to as modelling the probability distribution over pose using the output from the discriminator; the modelled pose should be realistic.

In essence, the pictorial structure framework objective is to represent the human body as a collection of coordinates for each body part in each input image. The pictorial structure framework uses nonlinear joint regressors, ideally a two-layered random forest regressor.
These models work well when the input image has clear and visible limbs, however, they fail to capture and model limbs that are hidden or not visible from a certain angle.
To overcome these issues, feature building methods like histogram oriented gaussian (HOG), contours, histograms, etc., were used. Despite using these methods, the classical model lacked accuracy, correlation, and generalization capabilities, so adopting a better approach was just a matter of time.

3d Human Body Modelling
In human pose estimation, the location of human body parts is used to build a human body representation (such as a body skeleton pose) from visual input data. Therefore, human body modelling is an important aspect of human pose estimation. It is used to represent features and key points extracted from visual input data. Typically, a model-based approach is used to describe and infer human body poses and render 2D or 3D poses.
Most methods use an N-joints rigid kinematic model where a human body is represented as an entity with joints and limbs, containing body kinematic structure and body shape information.
There are three types of models for human body modelling:
• Kinematic Model, also called skeleton-based model, is used for 2D pose estimation as well as 3D pose estimation. This flexible and intuitive human body model includes a set of joint positions and limb orientations to represent the human body structure. Therefore, skeleton pose estimation models are used to capture the relations between different body parts. However, kinematic models are limited in representing texture or shape information.
• Planar Model, or contour-based model, that is used for 2D pose estimation. The planar models are used to represent the appearance and shape of a human body. Usually, body parts are represented by multiple rectangles approximating the human body contours. A popular example is the Active Shape Model (ASM) which is used to capture the full human body graph and the silhouette deformations using principal component analysis.
• Volumetric model, which is used for 3D pose estimation. There exist multiple popular 3D human body models used for deep learning-based 3D human pose estimation for recovering 3D human mesh. For example, GHUM & GHUML(ite), are fully trainable end-to-end deep learning pipelines trained on a high-resolution dataset of full-body scans of over 60’000 human configurations to model statistical and articulated 3D human body shape and pose. It can be used to infer

Deep Learning-based approaches to 2D Human Pose Estimation

Deep learning-based approaches are well defined by their ability to generalize any function (if a sufficient number of nodes are present in the given hidden layer). When it comes to computer vision tasks, deep convolutional neural networks (CNN) surpass all other algorithms, and this is true in HUMAN POSE ESTIMATION as well.
CNN has the ability to extract patterns and representations from the given input image with more precision and accuracy than any other algorithm; this makes CNN very useful for tasks such as classification, detection, and segmentation. Unlike the classical approach, where the features were handcrafted; CNN can learn complex features when provided with enough training data.
Toshev et al in 2014 initially used the CNN to estimate human pose, switching from the classical-based approach to the deep learning-based approach, and they named it DeepPose: Human Pose Estimation via Deep Neural Networks. In the paper that they had released, they defined the whole problem as a CNN-based regression problem towards body joints.
The authors also proposed an additional method where they implemented the cascade of such regressors in order to get more precise and consistent results. They argued that the proposed Deep Neural Network can model the given data in a holistic fashion, i.e., the network has the capability to model hidden poses, which was not true for the classical approach.
With strong and promising results shown by Deep Pose, the human pose estimation research naturally gravitated towards the deep learning-based approaches.

Human Pose Estimation using Deep Neural Networks

As the research and development started to take off in human pose estimation, it brought forth new challenges. One of them was to tackle the multi-person pose estimation. DNNs are very proficient in estimating single human pose but when it comes to estimating multi-human they struggle because:
1. An image can contain multiple numbers of people in different positions.
2. As the number of people increases, the interaction between increases leads to computational complexities.
3. An increase in computational complexities often leads to an increase in inference time in real-time.

In order to tackle these problems, the researchers introduced two approaches: Top-down: Localize the humans in the image or video and then estimate the parts followed by calculating the pose.
Bottom-up: Estimate the human body parts in the image followed by calculating the pose.
Here I’ll be using Media Pipe an open-source library to perform deep neural network-based pose estimation on a given image or video frames.

Main Challenges of Pose Detection

• Human pose estimation is a challenging task as the body’s appearance joins changes dynamically due to diverse forms of clothes, arbitrary occlusion, occlusions due to the viewing angle, and background contexts.
• Pose estimation needs to be robust to challenging real-world variations such as are lighting and weather.

Therefore, it is challenging for image processing models to identify the fine-grained joint coordinates. It is especially difficult to track small and barely visible joints.

Introduction to Media-Pipe

Google open-source MediaPipe was first introduced in June 2019. It aims to make our life easy by providing some integrated computer vision and machine learning features. Media Pipe is a framework for building multimodal(e.g., video, audio or any time series data), cross-platform (i.e., Android, IOS, web, edge devices) applied ML pipelines. MediaPipe also facilitates the deployment of machine learning technology into demos and applications on a wide variety of different hardware platforms.
Notable Applications
Face Detection
Multi-hand Tracking
Hair Segmentation
Object Detection and Tracking
Objection: 3D Object Detection and Tracking
Auto Flip: Automatic video cropping pipeline

The image above shows the architecture of Blaze Pose which is a multi-stage CNN.

ML Pipeline

The solution utilizes a two-step detector-tracker ML pipeline, proven to be effective in our MediaPipe Hands and MediaPipe Face Mesh solutions. Using a detector, the pipeline first locates the person/pose region-of-interest (ROI) within the frame. The tracker subsequently predicts the pose landmarks and segmentation mask within the ROI using the ROI-cropped frame as input. Note that for video use cases the detector is invoked only as needed, i.e., for the very first frame and when the tracker could no longer identify body pose presence in the previous frame. For other frames, the pipeline simply derives the ROI from the previous frame’s pose landmarks.
The pipeline is implemented as a MediaPipe graph that uses a pose landmark subgraph from the pose landmark module and renders using a dedicated pose renderer subgraph. The pose landmark subgraph internally uses a pose detection subgraph from the pose detection module.
Note: To visualize a graph, copy the graph and paste it into MediaPipe Visualizer. For more information on how to visualize its associated subgraphs, please see the visualizer documentation.

Process of Estimation

• In the first step the image is passed through the baseline CNN network to extract the feature maps of the input In the paper. In this paper, the authors used the first 10 layers of the VGG-19 network.
• The feature map is then processed in a multi-stage CNN pipeline to generate the Part Confidence Maps and Part Affinity Field
o Part Confidence Maps:
o Part Affinity Field
In the last step, the Confidence Maps and Part Affinity Fields that are generated above are processed by a greedy bipartite matching algorithm to obtain the poses for each person in the image.

Optionally, MediaPipe Pose can predict a full-body segmentation mask represented as a two-class segmentation (human or background).

The pose estimation component of the system predicts the location of all 33-person key points and uses the person alignment proposal provided by the first stage of the pipeline. The author adopted a combined heatmap, offset, and regression approach. We use the heatmap and offset loss only in the training stage and remove the corresponding output layers from the model before running the inference. Thus, it effectively uses the heatmap to supervise the lightweight embedding, which is then utilized by the regression encoder network.
This approach is partially inspired by the Stacked Hourglass approach of Newell et al., but in this case, we stack a tiny encoder-decoder heat map-based network and a subsequent regression encoder network. The author actively utilizes skip-connections between all the stages of the network to achieve a balance between high and low-level features. However, the gradients from the regression encoder are not propagated back to the heatmap trained features. We have found this to not only improve the heatmap predictions but also substantially increase the coordinate regression accuracy.
Mediapipe Implements the backend on Blaze Pose architecture and Blaze Pose is written taking a base estimation as Open Pose. Thus, both of them are quite an efficient pose landmark detectors.

Implementation

We can download image data from Kaggle datasets from this link:
#Dataset Link: kaggle.com/datasets/niharika41298/yoga-poses-dataset?resource=download

• the First step is to import some libraries for data processing into our python notebook
o Media-Pipe (Getting Pose landmarks)
o OpenCV-python (Process Image files)
o NumPy (To Perform mathematical operation)
o Pandas (Creating and maintaining dataset)
o OS (Reading filesystem)
o Matplotlib (plotting numerical optimisation data in visual format)
o Sklearn (Data optimization)
o XGBoost (Machine learning model).

• Now we can check some example images from our datasets.

  • the Third step is to count data present in our dataset. While reading it from the file system and displaying it using the print statement under a for loop iterating over os.listdir() method.

output

So, we have a total of 1551 images with 1081 train and 470 test Images to process and make our machine learning dataset model predict the pose.

  • We have 5 classes of possessing to train our model on
    o Down dog

o Goddess

o Plank

o Tree

o Warrior 2

  • Loading landmark model from media pipe into a variable and making data frame with necessary headers. Our data frame must have X, Y, Z and, Visibility of coordinates for each predicted landmark also a title label for the pose. The pose is our feature and target column to get predictions on.

A Total of 133 dataset columns are to be made which are a collection of 33 key points/landmarks.
• Now finally let us process our data into a valid data frame and save it into a CSV file to train our model on.
• PROCESS
o Read images from folders named to test and train
o Convert BGR image to RGB format to deal with any colour interchanges
o Get a black Image of the same shape as of image that is been read from the file
o Get predicted landmark results and save them into a new row of our data frame
o If no results are achieved from a certain image, then that image is invalid, and it will be ignored by the algorithm.

o Draw the landmarks on the black image then plot it on a graph to see the resultant figure for a given image.
o Print the percentage of work done. And eventually, it will show how the percentage of data is valid from the provided set of images.
o Save all the landmarks in a data frame and plot images on matplotlib pyplot, to see what data looked like.

  • And the output will look like this:

Finally, save data into a file named data.csv to further use it in our training model
Model Training
• The final task is training our ML classifier to predict data classes on
• For the Training model I’m using xgboost.classifier() a method to fit my dataset on and give us train and validation losses according
• First is to split whole data into two sets
o Test (a set to test data)
o Train (a set to Train data)
• For splitting data, we can use sklearn library, using the train_test_split() method
• Make one eval set and pass it to the fit method.

  • Initialize a classifier using XGBClassifier and set some arguments
    o booster=’gbtree’
    o objective=’multi:softprob’
    o random_state=42
    o eval_metric=”mlogloss” (Multiclass Log Loss)
    o num_class=num_of_classes (Total number of classes is 5)
  • actually, I used two classifiers one to plot losses and another to plot area under the curve.
    o booster=’gbtree’
    o objective=’multi:softprob’
    o random_state=42
    o eval_metric=” AUC” (Area Under Curve)
    o num_class=num_of_classes (Total number of classes is 5)
  • finally fit the data into our classifiers and get the validation results.
  • We can get our predicted values using predict method from xgboost.xgbclassifier().predict()
    o Pass landmark data frame containing 133 cols after dropping target column from it.
    o This will return a list of predicted pose values for each row.
  • Got accuracy score using sklearn.accuracy_score library method
    o Pass predicted labels and expected labels
    o Returns accuracy score in scaled floating value from 0 to 1
    • Got area under curve score using sklearn.auc_score library method
    o Pass predicted labels and expected labels
    o Returns area under the curve in scaled floating value from 0 to 1
  • Accuracy score: 0.9868421052631579
  • roc_score(mlogloss): 0.9911849372289125
  • displaying predicted vs Expected results
    o create a data frame using pandas
    o add predicted values under the “Predicted” label
    o add expected values under the “Expected” label
    o display data frame
  • from the accuracy score we can get accuracy to be around 98.6%
  • finally getting plots for multiclass log loss and area under the curve for each training iteration

Fig — log loss curve

Fig — area under the curve

--

--