PoseFlow — real-time pose tracking

Multi-person pose tracking pipeline in an unconstrained video sequence using Detectron2 as a person keypoint estimator.

Jarosław Gilewski
Dec 29, 2019 · 5 min read
Image by StockSnap from Pixabay

This story presents one of the methods for multi-person articulated pose tracking in video sequence called PoseFlow and its adaptation with the Detectron2 COCO Person Keypoint Detection Baseline.

Detectron2 is a robust framework for object detection and segmentation (see the model zoo). It allows us to detect person keypoints (eyes, ears, and main joints) and create human pose estimation.

The person keypoints estimation is done on individual images and to fully understand human behaviour and be able to analyse the full scene, we need to track the person from frame to frame. The person tracking opens the possibility for action recognition, person re-identification, understanding human-object interaction, sports video analysis and much more.

Project setup

We will use the source code based on my previous story:

I encourage you to read it first! If you followed it just run the command below in the project directory:

$ git pull
$ git checkout 3acd755f03c16f3f3f0adeb6e12e18e721134808
$ conda env update -f environment.yml

or if you prefer to start from the beginning follow with:

$ git clone git://github.com/jagin/detectron2-pipeline.git
$ cd detectron2-pipeline
$ git checkout 3acd755f03c16f3f3f0adeb6e12e18e721134808
$ conda env create -f environment.yml
$ conda activate detectron2-pipeline

Pose estimation

To check what the pose estimation is all about, run the command:

$ python process_video.py -i assets/videos/walk.small.mp4 -p -d --config-file configs/COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml

We will get the following results on the screen:

Pose estimation on a video sequence (Video by sferrario1968 from Pixabay)

As you can see Detectron2 gives us the bounding box of the human and their keypoint estimations thanks to available COCO Person Keypoint Detection model with Keypoint R-CNN.

This model is based on Mask R-CNN, which is flexible enough to extend it to human pose estimation. The keypoint’s location is modelled as a one-hot
mask and Mask R-CNN is adopted to predict K masks, one for
each of K keypoint types (e.g., left shoulder, right elbow).

It’s a top-down method where we first detect human proposal and then estimate keypoints within each box independently.
There is also a bottom-up approach which directly infers the keypoints and the connection information between keypoints of all persons in the image without a human detector.

Another very popular, alternative estimators are:

There is a good article Human Pose Estimation with Deep Learning summarizing different approaches to human pose estimation.

Pose tracking

If you are perceptive enough, you will notice on the result above that the human poses are already tracked at least with the colour of the person bounding box. It’s a very naive heuristics to assign the same colour to the same instance for visual purpose based on intersection over union (IoU) of boxes or masks (see video_visualizer.py).

Using bounding box IoU to track pose instances will most likely fail when an instance moves fast thus the boxes do not overlap, and in crowed scenes where boxes may not have the corresponding relationship with pose instances.

Multi-person articulated pose tracking in unconstrained videos is a very challenging problem, and there are a lot of solutions for that.

The solution I would like to present is PoseFlow described in the paper PoseFlow: Efficient Online Pose Tracking by Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, Cewu Lu. The source code for the solution is available on GitHub, and it is also included in the AlphaPose repository.

Overall PoseFlow Pipeline: 1) Pose Estimator. 2) PoseFlow Builder. 3) PoseFlow NMS. First, we
estimate multi-person poses. Second, we build pose flows by maximizing overall confidence and purify
them by PoseFlow NMS. Finally, reasonable multi-pose trajectories can be obtained. (source: https://arxiv.org/abs/1802.00977)

In my opinion, the paper is quite hard to read, and the correspondence to the implementation code is a little bit blurry. Even though the method is called online pose tracking, you are not able to use it directly on a video stream. It’s an academic research code presenting the idea and preparing results for PoseTrack Challenge validation set.
Looking at the PoseTrack 2017 Leaderboard we can see that it is in 13th place in multi-person pose tracking challenge.

With the idea of the processing pipeline in mind, I’ve moved the necessary part of the PoseFlow code to the detectron2-pipeline repository (pipeline/utils/pose_flow.py, pipeline/libs/pose_tracker.py) and added pipeline/track_pose.py pipeline step.

From the code perspective, the algorithm could be simplified to the following steps (see pipeline/libs/pose_tracker.py, line 59–87):

  1. match ORB descriptor vectors for the current and previous frame using FLANN (Fast Library for Approximate Nearest Neighbors) based matcher from OpenCV (see OpenCV Feature Matching),
  2. stack all of the already tracked people’s info together from last --track-link-len frames (default 100),
  3. resolve the assignment problem using Hungarian matching algorithm with the weighted nodes like IoU for bounding boxes, IoU for poses and bounding box scores,
  4. Assign the person identification attributes (pid, score, box).

Let’s see it in action:

$ process_video.py -i assets/videos/walk.small.mp4 -p -d --config-file configs/COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml -tp
PoseFlow in action

The final result depends on a lot of factors like:

  • person detection model accuracy,
  • pose estimation model accuracy,
  • the quality of the image,
  • different pose tracking options (see process_video.py, line 51–60)

The Detectron 2 person keypoint detection model is not the best one for a robust pose tracking. There are specialized frameworks like already mentioned AlphaPose or OpenPose. You should experiment with both while creating your pipeline, replacing Detectron2 with the chosen model.

Happy coding!



Deep Learning in Computer Vision

Thanks to Sławomir Gilewski

Jarosław Gilewski

Written by

I’m a senior software engineer involved in software development for more than 20 years. Currently, I’m focused on computer vision and deep learning.


Deep Learning in Computer Vision

More From Medium

More from DeepVision.guru

More from DeepVision.guru

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade