This story presents one of the methods for multi-person articulated pose tracking in video sequence called PoseFlow and its adaptation with the Detectron2 COCO Person Keypoint Detection Baseline.
Detectron2 is a robust framework for object detection and segmentation (see the model zoo). It allows us to detect person keypoints (eyes, ears, and main joints) and create human pose estimation.
The person keypoints estimation is done on individual images and to fully understand human behaviour and be able to analyse the full scene, we need to track the person from frame to frame. The person tracking opens the possibility for action recognition, person re-identification, understanding human-object interaction, sports video analysis and much more.
We will use the source code based on my previous story:
How to embed Detectron2 in your computer vision project
Use the power of the Detectron2 model zoo.
I encourage you to read it first! If you followed it just run the command below in the project directory:
$ git pull
$ git checkout 3acd755f03c16f3f3f0adeb6e12e18e721134808
$ conda env update -f environment.yml
or if you prefer to start from the beginning follow with:
$ git clone git://github.com/jagin/detectron2-pipeline.git
$ cd detectron2-pipeline
$ git checkout 3acd755f03c16f3f3f0adeb6e12e18e721134808
$ conda env create -f environment.yml
$ conda activate detectron2-pipeline
To check what the pose estimation is all about, run the command:
$ python process_video.py -i assets/videos/walk.small.mp4 -p -d --config-file configs/COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml
We will get the following results on the screen:
As you can see Detectron2 gives us the bounding box of the human and their keypoint estimations thanks to available COCO Person Keypoint Detection model with Keypoint R-CNN.
This model is based on Mask R-CNN, which is flexible enough to extend it to human pose estimation. The keypoint’s location is modelled as a one-hot
mask and Mask R-CNN is adopted to predict K masks, one for
each of K keypoint types (e.g., left shoulder, right elbow).
It’s a top-down method where we first detect human proposal and then estimate keypoints within each box independently.
There is also a bottom-up approach which directly infers the keypoints and the connection information between keypoints of all persons in the image without a human detector.
Another very popular, alternative estimators are:
- AlphaPose, top-down method based on the following paper, and
- OpenPose, bottom-up method based on this paper.
There is a good article Human Pose Estimation with Deep Learning summarizing different approaches to human pose estimation.
If you are perceptive enough, you will notice on the result above that the human poses are already tracked at least with the colour of the person bounding box. It’s a very naive heuristics to assign the same colour to the same instance for visual purpose based on intersection over union (IoU) of boxes or masks (see
Using bounding box IoU to track pose instances will most likely fail when an instance moves fast thus the boxes do not overlap, and in crowed scenes where boxes may not have the corresponding relationship with pose instances.
Multi-person articulated pose tracking in unconstrained videos is a very challenging problem, and there are a lot of solutions for that.
The solution I would like to present is PoseFlow described in the paper PoseFlow: Efficient Online Pose Tracking by Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, Cewu Lu. The source code for the solution is available on GitHub, and it is also included in the AlphaPose repository.
In my opinion, the paper is quite hard to read, and the correspondence to the implementation code is a little bit blurry. Even though the method is called online pose tracking, you are not able to use it directly on a video stream. It’s an academic research code presenting the idea and preparing results for PoseTrack Challenge validation set.
Looking at the PoseTrack 2017 Leaderboard we can see that it is in 13th place in multi-person pose tracking challenge.
With the idea of the processing pipeline in mind, I’ve moved the necessary part of the PoseFlow code to the detectron2-pipeline repository (
pipeline/libs/pose_tracker.py) and added
pipeline/track_pose.py pipeline step.
From the code perspective, the algorithm could be simplified to the following steps (see
pipeline/libs/pose_tracker.py, line 59–87):
- match ORB descriptor vectors for the current and previous frame using FLANN (Fast Library for Approximate Nearest Neighbors) based matcher from OpenCV (see OpenCV Feature Matching),
- stack all of the already tracked people’s info together from last
--track-link-lenframes (default 100),
- resolve the assignment problem using Hungarian matching algorithm with the weighted nodes like IoU for bounding boxes, IoU for poses and bounding box scores,
- Assign the person identification attributes (
Let’s see it in action:
$ process_video.py -i assets/videos/walk.small.mp4 -p -d --config-file configs/COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml -tp
The final result depends on a lot of factors like:
- person detection model accuracy,
- pose estimation model accuracy,
- the quality of the image,
- different pose tracking options (see
process_video.py, line 51–60)
The Detectron 2 person keypoint detection model is not the best one for a robust pose tracking. There are specialized frameworks like already mentioned AlphaPose or OpenPose. You should experiment with both while creating your pipeline, replacing Detectron2 with the chosen model.