A Short Guide to Pose Estimation in Computer Vision

5 min readApr 2, 2020

This article will tackle the subject of pose estimation and will analyze how it works and compare different approaches and their pros and cons in regards to computer vision. We will also analyze applications of this cool technology

What is Pose Estimation?

Simply put, pose estimation is the localization of human joints in either images or videos. There are 2-D pose estimation and 3-D pose estimation (additional dimension of depth). Pose estimation has applications in several ideas and is an extension of computer vision.

Here is an example of 2-D versus 3-D pose estimation:

As seen in the image above, it is noticeable that there are certain joints which are being used to compute angles and keypoints. We will now examine for which pose estimation is built on.

How it works

Pose Estimation manipulates specific joints within the human body. These joints are known as “keypoints” within the pose estimation system. There are two models which are commonly referenced in pose estimation: the classical pictorial structures framework and deformable parts model. These models construct the keypoints and connecting levers through spatial arrangements between parts that allow for parameterization of the angles and joint position as vectors. The keypoints are typically labeled and connected at the end. Pose estimation keypoints usually line up with real human joints like the elbow, wrist, knee, etc. Here is an example of labeled keypoints in pose estimation:

Labeled Pose Keypoints for a Sample Image

We will now compare and contrast the various open-source pose estimation algorithms and analyze the benefits and drawbacks of each.

OpenPose

OpenPose represents the first real-time multi-person system to jointly detect human body, hand, facial, and foot keypoints (in total 135 keypoints) on single images.

Pros:
Use of non-parametric color-coded PAFs creates greater accuracy for mapping
High accuracy without compromise on execution performance
Use of confidence maps to map individual body parts/regions (i.e. shoulder)
Greedy parsing algorithm is effective in terms of runtime
Scales well to GPU over CPU
Cons:
Slight tradeoff between speed and accuracy (i.e. R-CNN runs faster)
Current human pose performance metrics are based on keypoint accuracy
Not a completely fair comparison
Failure cases still exist (i.e. foot and leg occluded, rare joint position, etc.)

MultiPoseNet

“MultiPoseNet can jointly handle person detection, keypoint detection, person segmentation and pose estimation problems. The novel assignment method is implemented by the Pose Residual Network (PRN) which receives keypoint and person detections, and produces accurate poses by assigning keypoints to person instances.”

Pros:
Residual Pose network allows for generating higher precision with less training error due to reformulated layers that have residual functions (easier to optimize and gain accuracy)
Assigns keypoints to person instances
Use of the unary relations between a certain keypoint and a specific group
Bottom-up method outperformed previous top-down methods in 2016 COCO keypoint challenge
Faster in time while smaller in size
PRN is also relatively lightweight so inference is easier
Cons:
Compromises resolution at stake of representations using FPN (Feature Pyramidal Networks)
Spatial performance is dependent on input and output resolution
Use of one-stage detector (RetinaNet) enables faster inference but leads to lower accuracy compared to two-stage detectors

Detectron/Mask R-CNN

“Detectron2 includes high-quality implementations of state-of-the-art object detection algorithms, including DensePose, panoptic feature pyramid networks, and numerous variants of the pioneering Mask R-CNN model family also developed by FAIR. Its extensible design makes it easy to implement cutting-edge research projects without having to fork the entire codebase.”

Pros:
Generates segmentation mask for instances
Simple to train as per the paper
Utilizes a ResNet for pose estimation network
Minimal domain knowledge
Pipeline that can predict boxes, segments, and key points simultaneously
Cons:
Not built immediately for pose estimation
Semantic segmentation can lead to incorrect human bounding boxes and therefore incorrect joint poses
Fails to account for rare poses and a few failure cases (i.e. overlapping objects)
Fast, but not optimized for speed

AlphaPose

“Alpha Pose is a very Accurate Real-Time multi-person pose estimation system. It is the first open-sourced system that can achieve 70+ mAP (72.3 mAP) on COCO dataset and 80+ mAP (82.1 mAP) on MPII dataset. To associate poses that indicates the same person across frames, we also provide an efficient online pose tracker called Pose Flow. It is also the first open-sourced online pose tracker that can both satisfy 60+ mAP (66.5 mAP) and 50+ MOTA (58.3 MOTA) on PoseTrack Challenge dataset.”

Pros:
Built to avoid errors in existing models such as incorrect localization or recognition
Accurate pose estimation even when bounding box or segmentation is inaccurate
Two step framework is more accurate compared to typically used one step/stage detectors or frameworks
Use of spatial transformer network (STN) which demonstrates high performance
Hyperparameter optimization was performed on network to increase accuracy
Cons:
More failure cases compared to other state-of-the-art models for pose estimation
Not optimized for accuracy
Two step framework does at times sacrifice speed or runtime performance

Conclusions

We can conclude that pose estimation approaches are approaching the state-of-the-art in computer vision. These methods have concrete applications in industry and study. However, due to occlusion of joints and anomalous angles, it will be a while before a practically perfect pose estimation implementation.