A Short Guide to Pose Estimation in Computer Vision

Siddharth Sharma
5 min readApr 2, 2020

This article will tackle the subject of pose estimation and will analyze how it works and compare different approaches and their pros and cons in regards to computer vision. We will also analyze applications of this cool technology

What is Pose Estimation?

Simply put, pose estimation is the localization of human joints in either images or videos. There are 2-D pose estimation and 3-D pose estimation (additional dimension of depth). Pose estimation has applications in several ideas and is an extension of computer vision.

Examples of Pose Estimation (Source)

Here is an example of 2-D versus 3-D pose estimation:

2D v. 3D pose estimation

As seen in the image above, it is noticeable that there are certain joints which are being used to compute angles and keypoints. We will now examine for which pose estimation is built on.

How it works

Pose Estimation manipulates specific joints within the human body. These joints are known as “keypoints” within the pose estimation system. There are two models which are commonly referenced in pose estimation: the classical pictorial structures framework and deformable parts model. These models construct the keypoints and connecting levers through spatial arrangements between parts that allow for parameterization of the angles and joint position as vectors. The keypoints are typically labeled and connected at the end. Pose estimation keypoints usually line up with real human joints like the elbow, wrist, knee, etc. Here is an example of labeled keypoints in pose estimation:

Labeled Pose Keypoints for a Sample Image

We will now compare and contrast the various open-source pose estimation algorithms and analyze the benefits and drawbacks of each.


OpenPose represents the first real-time multi-person system to jointly detect human body, hand, facial, and foot keypoints (in total 135 keypoints) on single images.

OpenPose Algorithm


Use of non-parametric color-coded PAFs creates greater accuracy for mapping

High accuracy without compromise on execution performance

Use of confidence maps to map individual body parts/regions (i.e. shoulder)

Greedy parsing algorithm is effective in terms of runtime

Scales well to GPU over CPU


Slight tradeoff between speed and accuracy (i.e. R-CNN runs faster)

Current human pose performance metrics are based on keypoint accuracy

Not a completely fair comparison

Failure cases still exist (i.e. foot and leg occluded, rare joint position, etc.)


“MultiPoseNet can jointly handle person detection, keypoint detection, person segmentation and pose estimation problems. The novel assignment method is implemented by the Pose Residual Network (PRN) which receives keypoint and person detections, and produces accurate poses by assigning keypoints to person instances.”

MultiPoseNet Algorithm


Residual Pose network allows for generating higher precision with less training error due to reformulated layers that have residual functions (easier to optimize and gain accuracy)

Assigns keypoints to person instances

Use of the unary relations between a certain keypoint and a specific group

Bottom-up method outperformed previous top-down methods in 2016 COCO keypoint challenge

Faster in time while smaller in size

PRN is also relatively lightweight so inference is easier


Compromises resolution at stake of representations using FPN (Feature Pyramidal Networks)

Spatial performance is dependent on input and output resolution

Use of one-stage detector (RetinaNet) enables faster inference but leads to lower accuracy compared to two-stage detectors

Detectron/Mask R-CNN

“Detectron2 includes high-quality implementations of state-of-the-art object detection algorithms, including DensePose, panoptic feature pyramid networks, and numerous variants of the pioneering Mask R-CNN model family also developed by FAIR. Its extensible design makes it easy to implement cutting-edge research projects without having to fork the entire codebase.”

Mask R-CNN algorithm for segmentation


Generates segmentation mask for instances

Simple to train as per the paper

Utilizes a ResNet for pose estimation network

Minimal domain knowledge

Pipeline that can predict boxes, segments, and key points simultaneously


Not built immediately for pose estimation

Semantic segmentation can lead to incorrect human bounding boxes and therefore incorrect joint poses

Fails to account for rare poses and a few failure cases (i.e. overlapping objects)

Fast, but not optimized for speed


“Alpha Pose is a very Accurate Real-Time multi-person pose estimation system. It is the first open-sourced system that can achieve 70+ mAP (72.3 mAP) on COCO dataset and 80+ mAP (82.1 mAP) on MPII dataset. To associate poses that indicates the same person across frames, we also provide an efficient online pose tracker called Pose Flow. It is also the first open-sourced online pose tracker that can both satisfy 60+ mAP (66.5 mAP) and 50+ MOTA (58.3 MOTA) on PoseTrack Challenge dataset.”

AlphaPose (RMPE) Framework


Built to avoid errors in existing models such as incorrect localization or recognition

Accurate pose estimation even when bounding box or segmentation is inaccurate

Two step framework is more accurate compared to typically used one step/stage detectors or frameworks

Use of spatial transformer network (STN) which demonstrates high performance

Hyperparameter optimization was performed on network to increase accuracy


More failure cases compared to other state-of-the-art models for pose estimation

Not optimized for accuracy

Two step framework does at times sacrifice speed or runtime performance


We can conclude that pose estimation approaches are approaching the state-of-the-art in computer vision. These methods have concrete applications in industry and study. However, due to occlusion of joints and anomalous angles, it will be a while before a practically perfect pose estimation implementation.


[1] https://nanonets.com/blog/human-pose-estimation-2d-guide/

[2] OpenPose

[3] MultiPoseNet

[4] Mask R-CNN

[5] AlphaPose