Towards Accurate Multi-Person Pose Estimation in the Wild

Papandreou, George, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. 2017. “Towards Accurate Multi-Person Pose Estimation in the Wild.” arXiv [cs.CV]. arXiv.

Key points of this paper:

  • This paper presents the new state-of-the-art results on MSCOCO key points challenge using top-down¹ approach (instead of the bottom up approach used in the previous state of the art method²).
  • The authors presents a 2-stages approach: 1. Faster R-CNN with Inception-ResNet to predict location and scale of people bounding boxes. 2. For each detection, apply a fully convolutional network (FCN) based on ResNet to predict vicinity and offset heatmap for each keypoint
  • The novelty in this paper is in stage 2 and the various tricks in cropping and post-processing the results.

Stage 2 Details (landmark detection and localization):

  • The FCN ResNet outputs 2 things for each of the 17 keypoints:
  • 1. Keypoint heatmap that predicts if a pixel is in the vincity of the keypoint.
  • 2. X and y offset heatmap to refine the keypoint location. This offset heatmap is to deal with keypoints that are occluded or where there are multiple keypoints of the same kind in the cropped image.
  • To obtain the final location, the authors weigh the L2 offset by the probability from the offset heatmap. (See eq. 1 for details)
  • The stage 2 network also outputs an extra prediction heatmap from the a intermediary layer as auxiliary loss.
  • Interestingly, the authors use keypoint probabilities (from stage 2.) to replace the person detection probabilities (from stage 1.) and found that this “significantly” improves AP
  • The authors use Huber (aka smooth L1) loss for the offset heatmap.


  • Object keypoint similarity (OKS) is used instead of regular IOU when performing non-maximal suppression (NMS) so that it can take account into the predicted keypoints (from stage 2).
  • This means the authors must have ran landmark detection for every detection from stage 1! (Note that OKS is also the evaluation metrics in MSCOCO)

My takeaways:

  • Disappointingly, the authors did not explain or qualitatively compare why their top-down approach is superior. It will be interesting to see the results of combining the heatmap scheme from this paper and the part affinity field from the previous state of the art.
  • This approach feeds a lot of information backwards from the stage 2 to stage 1, which shows the deficiency of the Faster-RCNN person detector. I wonder how much of this manual tweaking can be eliminated if the networks are trained end-to-end?


¹ Top-down approach refers to which keypoints are detected locally inside each person bounding box. This contrasts with bottom-up approach which keypoints are detected globally in the whole image and the keypoints are associated with each other to form a “person detection”. The major advantage of the bottom-up approach is that it is not limted by the people detector.

² Previous MS COCO 2016 Keypoint Challenege state-of-the-art: Cao, Zhe, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2016. “Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields.” arXiv [cs.CV]. arXiv.

Like what you read? Give Felix Lau a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.