Computer Vision Part 8: Pose Estimation, stick figures using AI

17 min readJul 28, 2020

Introduction

As the name suggests, with pose estimation we try to infer an object’s or person’s pose from an image. This involves in the identification and localization of keypoints on the body. Due to small joints, occlusions and lack of context, rotation and orientation of a body, identification of keypoints is quite a challenging task. In the case of human pose estimation, which we will mostly focus on for the rest of this article, major joints like knees, elbows, shoulders and wrists represent these keypoints.

Taxonomy wise, pose estimators can be classified into the following:

dimensionality (2D vs 3D)
single and multi pose (detect one object or multiple)
methodology (keypoints based vs instance based)

We can predict the 2D location of keypoints in an image or video frame using 2D pose estimators while 3D pose estimators transforms an object in an image into a 3D object by adding the depth in the prediction. Obviously, going 3D is more challenging (and will be discussed another time). Single pose estimators have typically as goal to detect and track one person or object, while multi pose estimation approaches detect and track multiple people or objects. In terms of methodology, broadly speaking, we find models which try to detect all instances of a particular keypoint and then attempts to group keypoints into skeletons. Instance based pose estimators first uses an object detector to detect instances of an object, to then estimate the keypoints within each cropped region. In the literature, this also often referred as bottom-up vs top-down approach.

Top-down approaches consists of applying a person detector on an image and for each detected person a single-person pose estimator is used for keypoints inference. If your person detector fails, so will your pose estimation. Furthermore, the amount of required processing is proportional to the number of persons. Bottom-up suffers less from these cons but associate keypoint detection candidates to an invidual person, is challenging nevertheless.

DeepPose

In this paper, the authors propose the first application of a Deep Neural Network (DNN) to the human pose estimation challenge. Below, we find the architecture used. For the astute among you, this is basically an AlexNet.

Using the input image, each body joint and its location can be regressed. By passing the initial pose estimation of the original to a cascade of such DNN’s, joint predictions can be further refined achieving then SOTA results.

Deep(er)Cut

With DeepCut, the pose estimation problem for an unknown number of persons in an image was reformulated as an optimization problem. The problems are:

Create a set of all body parts candidate in an image, from which a subset is selected.
From this subset, classify each body part (e.g., arm, leg and head)
Cluster the body parts of the same person together.

These 3 problems are then solved by modelling it into an Integer Linear Program problem.

To find all the body parts in an image, an adapted version of Fast R-CNN was used (AFR-CNN). Specifically, the adaptation consists by replacing the Selective Search proposal generation with a Deformable Parts Model (DPM) and to alter the detection size as to allow the DPM to capture more context.

Departing from a 70’s study where they deal with the problem that given some description of a visual object, how to find this object in an actual photograph? In true engineeresque form, an object is modeled by a collection of parts arranged in a deformable configuration.

Representing a human by a collection of parts arranged in a deformable configuration. The appearance of each part is then modeled separately. Pair of parts are represented by spring to introduce the necessary deformability.

Doubting that using DPM’s may be suboptimal (it was), a dense CNN build on VGG was trained instead. The detection of body parts is then reformulated as a multi-label classification. The model outputs for each candidate a part probability scoremaps. Furthermore, similar to other segmentation tasks, dilated convolution with stride 8 was used to have finer parts localization.

DeeperCut is based off the Dense-CNN of DeepCut but with a ResNet backbone instead. Similar to the VGG backbone, the original stride of 32px is too coarse. Using the hole algorithm however was infeasible due to memory constraints. The ResNet architecture was tweaked by removing the last layers, stride of the first convolutional layers were reduced to prevent down sampling. Holes were added to all 3x3 convolutions in the 5th conv. layer and deconvolutional layers were used for upsampling.

DeeperCut also benefits from the larger receptive field to reason about locations of other parts in the vicinity. This insight is called Image-Conditioned Pairwise Terms and allows to compute the pairwise probabilities of parts.

Pairwise part-to-part predictions: By computing a cost for each pair, the regressed offsets and angles are used as features to train a logistic regression resulting in a pairwise probability

Where DeepCut solved one instance of the ILP for all body part candidates in an image, DeeperCut proposes an incremental 3-stage optimization where:

ILP is solved for head and shoulders
elbows/wrists is added to the stage 1 solution and the ILP is re-optimized
remaining body parts are added to the stage 2 solution and the ILP is re-optimized

Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation

In this paper, the detection pipeline consists of a Convolutional Network and a Markov Random Field (MRF). Similar to before, the ConvNet architecture is used for body part localization. The architecture is shown below:

Multi-Resolution Sliding-Window With Overlapping Receptive Fields

The architecture processes the input image by using a sliding window approach resulting in a pixelwise heatmap representing the likelihood for each joint location. There are 2 overlapping multi-resolution fields, one is a 64x64 input (upper convolution) and the other is 128x128 input downsampled to 64x64 resulting in more “context” been fed into the lower convolution path. Both are then normalized using Local Contrast Normalization (LCN) before being passed in the network. The authors mention that the main advantage of using overlapping fields consists of being able to see a larger part of the image for a relatively low increase in weights. Furthermore, by using LCN the overlapping spectral content between both windows is minimal. As this demands some considerable computing power, the model was improved as observed below.

Both concepts of multi resolution (lower ConvNet) and sliding window (upper ConvNet) are kept. The high context, and low resolution input, needs to have half the stride than that of the sliding window model. As such, 4 down sampled images need to be processed. The feature maps of the sliding window are replicated where low resolution feature maps are added and interleaved, resulting in an output heatmap lower than the input.

The Part-Detector will output many poses that are anatomically incorrect as no implicit constraints of the body keypoints were modeled. This was cleverly addressed by using a high-level Spatial Model for creating constraints in terms of join interconnections and anatomical consistency of the poses. This Spatial Model is formulated as a MRF model. By first training a Part-detector and reusing the resulting heatmap outputs to train a Spatial-Model, we are able to train a MRF which will formulate joint dependencies in a graph model. Finally, finetuning and backpropagation occurs on a unified model (Part-Detector + Spatial Model)

Efficient Object Localization Using Convolutional Networks

Based on the previously mentioned work, this research implemented a multi-resolution ConvNet to estimate the joint offset location within a small region of the image. Below, we find the architecture and easily see the similarity with previously discussed architecture.

In addition, a Spatial Dropout layer was added. It was found that applying standard dropout did not prevent overfitting due to the strong spatial correlation in the feature maps. The solution is to drop entire 1D feature maps instead, promoting independence between feature maps. Similarly to before, the (coarse) heatmaps are passed to a MRF which will filter out the anatomically infeasible poses.

The next step, is to recover the spatial information which was lost due to pooling. This was achieved by using another ConvNet to refine the results of the coarse heatmaps.

Convolutional Pose Machines

Convolutional Pose Machines (CPM) inherits and builds upon the Pose Machine (PM) architecture which incorporates rich spatial interactions from the body parts and across different scales into a modular and sequential framework. As we will see, CPM takes PM further by utilizing convolutional architectures which learns feature representations for both image and spatial context.

As we see below, a PM is a sequential prediction algorithm that emulates the mechanics of message passing to predict a confidence for each body part. The rationale is that the estimated confidence for each body part is iteratively improves through each stage. Message passing can be understood as a sequence of probabilistic classification where the output of a predictor (whatever type of multi-class classifier) becomes the input of the next predictor.

Architecture of a 1 Stage Pose Machine (a) and a 2 Stage Pose Machine (b)

In each stage, a classifier predicts a location with a confidence for each body part based from the output of previous classifier and the features of the image. For each stage then, the predictions are refined. Lastly, we can observe that for each image a hierarchical representation was created by reusing the image at different scales. At level 1, as seen in the image, a coarse representation of the whole model is made whereas level 2 represents compositions of body parts and finally level 3, the finest representation, constitutes of a region around a keypoint. A single multi-class predictor for each stage was trained across all hierarchy levels. Meaning that each predictor is trained to output a set of confidences for each keypoint from a feature vector which can originate from whichever hierarchy level. Below in row (a), we can observe how spatial correlations between confidences for each body part is constructed by concatenating the confidence scores in location z resulting in a vectorized patch. To obtain long-range interactions, non-maxima suppression is applied to obtain a list of peaks (high confidence locations) for each keypoint/body part from which the offsets in polar coordinates can be calculated.

Replacing the prediction and feature extraction parts by a CNN results then in our CPM, an end-to-end architecture.

Architecture of the Pose Machine (a & b) and Convolutional Pose Machine (c & d)

The first stage of this architecture creates a feature map from a growing receptive field based on the input image. Subsequent stages, will use both input image and the feature maps from previous stage to refine the predictions for each body part. The use of an intermediate loss layer prevents the gradients from vanishing during training. As also stated in the paper, subsequent predictors can make use of previous feature maps as strong cues where certain parts should be and thus help eliminate wrong estimations. By gradually increasing the receptive field, the model can learn to combine contextual information in feature maps allowing it to learn complex relations of body parts without to have to model any graphical model representing the human body.

Stacked Hourglass Networks

Motivated by the need to capture information at every scale, a novel CNN architecture was developed where features across all scales are processed to capture spatial relationships of the human body. Local information is necessary for identifying body parts whereas anatomical understanding is better recognized at different scales.

In the picture above, we can immediately discern the symmetric partitioning of bottom-up and top-down processing. This type of architecture was discussed previously concerning Semantic Segmentation, only it was referred to as conv-deconv or encoder-decoder architecture.

Generally, a set of convolutions and max pooling layers process the input features. After each max pooling layer, we branch off the network and apply another set of convolutions and max pooling layers to the original feature input. In the image above, each block consists of a set of convolutions and max pooling layers. The precise configuration of conv-layers is then quite flexible. From ResNets’ success, the authors finally implement a residual module in each block. Once the lowest resolution attained, the decoder or top-down approach is started where the network effectively combines the features across different scales. Finally, which is not visible in the image, two 1x1 convolutions are applied to produce a set of heatmaps where each heatmap predicts the probability of the presence of a keypoint.

By creating a sequence of hourglass modules where the output of one feeds the input of the other, a mechanism for reevaluating the features and higher order spatial relationships is obtained . Similar as before, it provided vital to have intermediate loss functions. As is, the loss (or supervision) can only be provided after the upsampling stage. As such, there is no way for the features to be reevaluated in a larger global context. Meaning that if we want the network to refine the predictions, these predictions mustn’t be only of local scale but of more larger scale to enable the predictions to relate across a larger context of the image. Below, we can observe the solution proposed:

Overview of intermediate supervision process where a loss is applied on produced heatmaps (blue)

Intermediate heat maps are generated, a loss is applied on those and then a 1x1 conv is used to remap these heat maps to the features and combining them with the features outputted by previous hourglass module.

Training is done on a sequence of a whopping 8 hourglass modules, where each other weights are not shared. Using Mean squared loss on the heat maps, each module uses the same loss function and ground truths.

OpenPose

OpenPose, which is also the first open-source library for real time keypoints detection, is an improved CMUPose. In CMUPose, the first bottom-up pose estimator using Part Affinity Fields (PAFs) is presented.

Given an input image, heatmaps representing the probabilities of keypoints being present on each pixel are and vector fields of part affinities are generated. Both are produced by the 2-branch multi stage CNN observed below.

An input image is passed through the first 10 layers of a finetuned VGG from which feature maps F are generated. This feature map F is then used as input to the first stage of each branch. Branch 1 (top branch) predicts the confidence maps for the keypoints whereas Branch 2 predicts the part affinities fields. Refinement of confidence maps and affinity fields happens by concatenating previous predictions from both branches and the feature map F. At the end of a stage, L2 loss is applied between estimation and ground-truth.

As seen regularly earlier, confidence maps are 2D heatmaps expressing the belief that a keypoint is present at a given pixel. Part affinity fields are 2D vector fields which encodes the direction from one part of a limb to another. This feature representation has as advantage that it preserves both location and orientation information across the region of support of the limb. Performing non-maximum suppression, we obtain a set of candidates of body part locations. Each of those can then be assigned to several persons. Using the line integral computation, which quantifies the effect of a field along a curve, on the affinity fields body parts are matched to humans.

Building on top of the work of CMUPose, OpenPose utilizes only the PAFs for the pose estimation task removing the body part confidences. Below, we can observe that PAFs are first encoded, which represents the part-to-part association and are then fed into a CNN to infer detection confidence maps.

Network depth is increased by replacing the 7x7 conv-layers by 3 consecutive 3x3 kernels which outputs are concatenated. Computational-wise, processing is halved as it is not necessary anymore to refine PAFs and confidence maps at each stage. Instead, first the PAFs are refined and passed on the next stage to then refine confidence maps. If PAFs are processed, body parts locations can be inferred, the reverse, however, is not true.

(Higher)HRNet

A novel architecture is discussed were high-to-low resolution sub-networks are connected in parallel rather than in series as done in most existing solutions which maintains the high-resolution representations.

Rich high resolution features are obtained by multi-scale fusions across sub-networks such that each of the high-to-low resolution representations receives information from other parallel representations. Down-sampling occurs by using strided convolutions whereas up-sampling happense by a 1x1 convolution and nearest neighbor up-sampling. Heatmaps are regressed from the main high-resolution branch.

Based on this initial work Higher HRNet adressess 2 main challenges:

How to improve the inference performance of small persons without sacrificing the inference performance of large persons?
How to generate high-resolution heatmaps for keypoints detection of small persons?

Using HRNet as backbone, HigherHRNet (below) adds a deconvolution module where heatmaps are predicted from higher resolution feature maps.

The stem is sequence of 2 strided 3x3 conv layers to decrease the resolution by a quarter which then the input passes through the HRNet backbone. The 4x4 de-convolutional layer, followed by BatchNorm and ReLU takes as input the features and predicted heat maps and generates a feature map twice the input size. Residual blocks (4) are added after the de-conv layer to refiner the high-resolution feature map. Finally, heatmaps of the feature pyramid are aggregated by using bilinear interpolation to upsample the low-res featuremap and final prediction is obtained by averaging over all heatmaps.

PifPaf

PifPaf was developed with as goal to estimate human poses in crowded humans in urban settings making it suited for self-driving cars, delivery robots and others. Below, we observe that a ResNet backbone is used with 2 heads: Part Intensity Field (PIF) predicts the location, size of a keypoint and its confidence whereas Part Association Field (PAF) predicts associations between the keypoints.

More specifically, PIF outputs a confidence, a vector component pointing to the closest keypoint with a spread factor and a scale. As seen below, the confidence map is quite coarse. As such, localization of this confidence map is improved by fusing it with the vector field generating a higher resolution confidence map. The scale or spatial extent of a joint can then be learned from this field. This scale and the aforementioned spread helps with improved pose estimation performance across humans of different sizes.

Left: confidence map, Middle: vector field, Right: fused confidence map

Joint location are connected bottom-up into poses using PAF by trying to connect a pair of keypoint associations. Example of those 19 associations are:

left ankle to left knee
left hip to right hip
nose to right eye

PAF associating left shoulder with left hip

For a given featuremap, at each location, the origin of the two vectors of the keypoint associations is predicted as a confidence by PAF (upper image left). Confidence of associations above 0.5 are shown on the right side.

Finally, the decoder takes both fields (PIF & PAF) and converts them into a set of coordinates (17) representing a human skeleton. A greedy algorithm creates a priority queue of all keypoint types by descending confidence. These points serve as candidates (seeds) which are popped out of the queue and connections to other joints are added with the help of the PAF fields. PAS associations are scored as dultiple connections between current and next keypoints can occur. Finally, non-max suppression for each keypoint type is aplied to produce the human skeleton.

DirectPose

The first multi person pose estimator was proposed where keypoints annotations was used for training end-to-end whereas for inference, the model was able to map an input to keypoints for each individual instance without any box detections. Based on the emergence of anchor-free object detection, which immediately regresses 2 corners of a target bounding box, the researchers addressed if such detection technique can be used to detect keypoints. The rationale being that a detection task can be reformulated as a special bounding box with more than 2 corner points. They proved it performed poorly due to the main fact only one feature vector is used to regress all the keypoints. They solve this challenge by extending the Fully Convolutional One-Stage Object Detection (FCOS) architecture with one output branch for keypoint detection.

FCOS reformulates the object detection task in a per-pixel fashion. Similar to semantic segmentation, FCOS views the pixels on the input image as training samples instead of anchor boxes in anchor-based detectors. Pixels which fall into the ground truths of bounding boxes are considered positive and acquire the following:

the ground truth’s class label
a 4D vector representing the distances from the location to the four sides of the bounding box and is used as the regression targets for the location

Making use of a Feature Pyramid Network (FPN) ensures a better robustness for different scales of object sizes. Feature maps, generated by the backbone (ResNet50) are followed by a 1x1 convolution. The feature levels P3, P4, P5, P6 and P7 have strides 8, 16, 32, 64 and 128. Except for P6 and P7, the respective lateral connections and the top-down pathways are merged by addition. Multi-level prediction also handles the possibility that two different bounding boxes of different sizes overlap each other. FCOS restricts the regression at different feature map levels with following thresholds: 0, 64, 128, 256, 512 and infinity for all the feature levels (P3 to P7). These thresholds represent the maximum distance that a feature level Pn needs to regress. If overlapping bounding boxes still occur, the smallest one is chosen. As different feature levels regresses different size ranges, different heads are required. Lastly, the authors introduce the concept of center-ness due to lots of low-quality predicted bounding boxes far away from an object’s center. This head predicts the normalized distance based on the bounding box’s location of its 4 sides.

DirectPose regards keypoints as very special bounding-boxes with K corner points. However, during their experiments an inferior performance was observed due to the lack of alignment between the features and the predicted keypoints. This is due to the fact that many keypoints are distant from the center of the feature vector’s receptive field. As an input signal deviate more and more from the center of the receptive field, the intensity of the feature’s response to that input decays increasingly.

As such, a Keypoint Align Module (KPAM) is proposed. Taking a 256 channels featuremap as input, KPAM will densely slide this feature map. The locator, as the name suggests, locates the indices of where a feature vector predicts a keypoint instance from which the feature sample samples a feature vectors of length 256. For a n-th keypoint, the n-th conv-layer will takes as input the n-th feature vector and will predict the coordinates relative to the location of sampled feature vector. By summing the K offsets from the Locator and from the KPAlign, we obtain coordinates that needs to be rescaled matching the original feature map. A small tweaking was finally used where keypoints which are always present in an area (nose, eyes and ears) are grouped and use the same feature vector.

Finally, we can see how the KPAM replaces the bounding-box module of the aforementioned FCOS architecture. We do observe an additional heatmaps branch which is used as an auxiliary task/loss for making the regression-based task much more feasible.

Conclusion

Clearly, the task of estimating poses is quite the considerable challenge. It’s been repeatedly proven that bottom-up approaches are superior to top-down approaches but then one needs to associate keypoints to a person. This grouping or assembling process to produce final instance aware keypoints can be accomplished using heuristics, human skeleton modelling (pictorial structures) and/or stacking confidence maps. Furthermore, the complexity explodes when thinking that an unknown number of people can appear wherever and in whatever scale on the image. Human interactions, articulations and of course occlusions makes the keypoint assembling process complex.

Pose Estimation finds significant applications in fields like human computer interaction, action recognition, surveillance, picture understanding, threat prediction, robotics, AR and VR, animations and gaming.