Review CMUPose & OpenPose — Winner in COCO KeyPoint Detection Challenge 2016 (Human Pose Estimation)

First Open-Source Realtime System for Multi-Person 2D Pose Detection

Published in

Analytics Vidhya

7 min readMar 15, 2020

In this story, CMUPose & OpenPose, are reviewed. CMUPose is the team name from Carnegie Mellon University which attended and winned the COCO keypoint detection challenge 2016. And the approach is published as 2017 CVPR with over 2000 citations.

**CMUPose: Winner of the COCO keypoint detection challenge 2016 (http://cocodataset.org/#keypoints-leaderboard**)

Afterwards, more enhanced OpenPose was proposed, by University of California, Carnegie Mellon University and Facebook Reality Lab, with the first combined body and foot keypoint dataset and detector. It is also the first open-source realtime system for multi-person 2D pose detection. Finally, it is published as 2019 TPAMI with over 300 citations. (Sik-Ho Tsang @ Medium)

Since OpenPose is the enhanced version of CMUPose, OpenPose is mainly mentioned in this story.

Outline

Overall Pipeline
CMUPose Network Architecture
OpenPose Network Architecture
OpenPose Loss Function and Other Details
OpenPose Extended Foot Detection
Results (in OpenPose Paper)

1. Overall Pipeline

(a): A color image of size w×h as an input image.
(b): A feedforward network simultaneously predicts a set of 2D confidence maps (CM) S of body part locations, and
(c): a set of 2D vector fields L of part affinities, or part affinity fields (PAF), which encode the degree of association between parts
The set S = (S1, S2, …, SJ ) has J confidence maps, one per part.
The set L = (L1, L2, …, LC) has C vector fields, one per limb. Each image location in LC encodes a 2D vector.
(d): Then, the confidence maps and the affinity fields are parsed by greedy inference, and
(e): output the 2D keypoints for all people in the image.

2. CMUPose Network Architecture

The image is first analyzed by the first 10 layers of VGG-19, generating a set of feature maps F that is input to the first stage of each branch.
At the first stage, the network produces a set of CMs, S1 = ρ1(F), and a set of PAFs, L1= φ1(F).
S and L can be refined iteratively to improve the detection results.
At Stage t, it becomes:

Based on S and L, human pose can be detected.

3. OpenPose Network Architecture

The architecture of OpenPose is different from the CMUPose one.
The network first produces a set of PAFs, Lt.

Then producing a set of CMs, St.

It is found that refining PAFs is more important than refining CMs. Thus, PAF refinement is more critical and sufficient for high accuracy, removing the body part confidence map refinement while increasing the network depth.

In CMUPose, the network architecture included several 7×7 convolutional layers. In OpenPose, each 7×7 convolutional kernel is replaced by 3 consecutive 3×3 kernels. The receptive field is preserved and number of operations is reduced. While the number of operations for the former is 97, it is only 51 for the latter.
Additionally, the output of each one of the 3 convolutional kernels is concatenated, following an approach similar to DenseNet. The number of non-linearity layers is tripled, and the network can keep both lower level and higher level features.

4. Loss Function and Other Details

4.1. Loss Function

An L2 loss is used between the estimated predictions and the groundtruth maps and fields.
Lc* is the groundtruth PAF, Sj* is the groundtruth part confidence map, and W is a binary mask with W(p) = 0 when the annotation is missing at the pixel p.
The intermediate supervision at each stage addresses the vanishing gradient problem by replenishing the gradient periodically.

4.2. Confidence Maps (CM)

The confidence maps S* for each person k and each body part j are:

As the equation shown above, it is a Gaussian dot with gradual change where the peak is at the center of the dot where σ controls the spread of the peak.
These CMs actually are similar to the heat maps used in Tompson NIPS’14.

4.3. Part Affinity Fields (PAF)

For multi-person keypoint detection, we need to know which body part is linking to which body part.
Because say for example, if there are multiple persons, there are multiple head parts and shoulder parts. Especially, they are closely grouped together, it is difficult to distinguish which head and shoulder parts belong to the same person.
Therefore, a link is needed to specify the linkage between a specific head part and shoulder parts which belong to the same person.
This kind of linkage is represented by PAF in this paper. A stronger PAF between two body parts, a more higher confidence that these two parts are linked and belong to the same person.
And the PAF L* is the unit vector if p is lied on limb, otherwise it is 0:

The predicted part affinity field, Lc along the line segment is to measure the confidence for two candidate part locations dj1 and dj2:

For multi-person, total E needs to be maximized:

There are multiple approaches to connect the body part:

(a): Two-person body parts.
(b): Matching by considering all edges.
(c): Matching by minimal tree edges.
(d): Greedy algorithm used by OpenPose which only learns the minimal edges.
Actually, there are a lot of details about CMs and PAFS, please read the paper for more details.

5. OpenPose Extended Foot Detection

OpenPose has proposed the first combined body and foot keypoint dataset and detector as shown above.
By including the foot keypoints, it is able to detect the ankle correctly as shown at the right of the above figure.

6. Results (in OpenPose Paper)

6.1. MPII Multi-Person

For the 288 images subset as well as the full testing set, OpenPose also obtains high mAP, outperforms or being comparable to DeepCut, DeeperCut, and Newell ECCV’16.

Left: If the ground-truth keypoint location is used with the proposed parsing algorithm: 88.3% mAP.
Using GT connection with the proposed keypoint detection 81.6% mAP.
Right: More stages, higher mAP.

6.2. COCO Keypoints Challenge

There are top-down approaches which detect person first then detect the keypoint while bottom-up approaches are to detect keypoints first to form the person skeleton.
In the above table, OpenPose performs not so well. It is mainly because there is a higher drop in accuracy when considering only people of higher scales (AP^L).

5PAF — 1CM: 5 stages of PAF and 1 stage of CM has the highest mAP of 65.3.
3CM-3PAF: 61.0% mAP only. The accuracy of the part confidence maps highly increases when using PAF as a prior.

6.3. Inference Time

OpenPose nearly remains the same runtime regardless the number of people per image.
While top-down approach like Alpha-Pose and Mark R-CNN, the runtime is directly proportional to the number of people per image.