Posetrack Data set: Summary
Reference- PoseTrack: A Benchmark for Human Pose Estimation and Tracking and https://posetrack.net/
Posetrack is a new large-scale benchmark for video-based human pose estimation and articulated tracking
This new data set focuses on 3 task
- Single-frame multi-person pose estimation
- Multi-person pose estimation in videos
- Multi-person articulated tracking
A public centralized evaluation server is provided to allow the research community to evaluate on the held-out test set
The proposed data set contains over 150,000 annotated poses and over 22,000 labeled frames with various activities being performed making it the diverse and the largest data set for multi-person pose estimation and tracking.
The data set contains a number of videos from the MPII Human Pose data set. The video sequence that was chosen from this data set was such that they represent crowded scenes with multiple articulated people engaging in various dynamic activities. The video sequences are chosen such that they contain a large amount of body motion and body pose and appearance variations. They also contain severe body part occlusion and truncation, i.e., due to occlusions with other people or objects, persons often disappear partially or completely and re-appear again. The scale of the persons also varies across the video due to the movement of persons and/or camera zooming. Therefore, the number of visible persons and body parts also varies across the video.
The overall data set contains 550 video sequences which include 66,374 frames.
Annotating the data set
The video sequences were annotated with
- Locations
- Identities
- Body pose
- Ignore regions
First Ignore regions were labeled to identify people that are extremely difficult to annotate
Afterward, the head bounding boxes for each person across the videos were annotated and a tracking ID was assigned to every person. The head bounding boxes provide an estimate of the absolute scale of the person required for evaluation.
A unique track ID is assigned to each person appearing in the video until the person moves out of the camera field of view.
Note: Each video in the dataset might contain several shots. They do not maintain track id between shots and the same person might get a different ID if it appears in another shot.
Poses for each person's track are then annotated in the entire video. They annotated 15 body parts for each body pose which includes — the head, nose, neck, shoulders, elbows, wrists, hips, knees, and ankles. All the pose annotations were performed using the VATIC Tool.
One example of the labeled sample is shown below
Training and validation/testing videos are annotated in different manners
-> The length of the training videos ranges between 41–151 frames and they densely annotated 30 frames from the center of the video.
-> The number of frames in validation/testing videos ranges between 65 to 298 frames and in this case they densely annotated 30 frames around the keyframe from the MPII Pose dataset and afterward annotate every fourth frame.
Challenges conducted with this data set :
The benchmark consists of the following challenges:
Single-frame pose estimation: This task is similar to the ones covered by existing data sets like MPII Pose and MS COCO Key points but on our new large-scale data set.
Pose estimation in videos: The evaluation of this challenge is performed on single frames, however, the data will also include video frames before and after the annotated ones, allowing methods to exploit video information for more robust single-frame pose estimation.
Pose tracking: This task requires providing temporally consistent poses for all people visible in the videos. Our evaluation includes both individual pose accuracy as well as temporal consistency measured by identity switches.
Evaluation Metrics :
In order to evaluate whether the body part is correctly predicted or not they use PCKh(head-normalized probability of correct keypoint) — It considers a body joint to be correctly localized if the predicted location of the joint is within a certain threshold from the true location.
Due to the large-scale variation of people across videos and even within the frame, this threshold needs to be selected adaptively, based on the person’s size. For that, they use 50 % of the head length where head length corresponds to 60% of the diagonal length of the ground truth head bounding box.
Given the joint localization threshold for each person, they compute two sets of evaluation metrics — one which is commonly used for multi-person pose estimation and one from the multi-target tracking literature
Multi-person pose estimation — For measuring frame-wise multi-person pose accuracy, they use mean Average Precision(mAP). Unlike usual evaluation protocol that requires that the location of a group of persons and their rough scale is known during evaluation, they propose not to use any ground-truth information during testing and evaluate the predictions without rescaling or selecting a specific group of people for evaluation
Articulated multi-person pose tracking — To evaluate multi-person pose tracking they use Multiple Object Tracking(MOT) metrics. The metrics used for evaluation include — Multiple Object Tracker Accuracy (MOTA), Multiple Object Tracker Precision (MOTP), Precision, and Recall.
Updates on Posetrack 2018
The new release contains 1356 annotated sequences (593 train, 170 validation, and 375 test) with 276,198 annotated body poses in total. This is more than twice the amount of data compared to the PoseTrack17
The Annotations
There are 17 key points annotated for each person.
The annotation file is a JSON file in the format of the MSCOCO dataset and you can use pycoco tools to read the annotations.
There is one annotation file for one video which contains annotations for every frame in that video. However, not all frames are annotated. The following are the annotations for each frame
{
“has_no_dense_pose”: true.
“is_labeled”: true. → Whether or not that frame is labelled
“file_name”:”images/val/000342_mpii_test/000000.jpg” . → the path to that specific frame
“n_frames”: 100. → Total number of frames in this video
“frame_id”:10003420000. → the frame id of this frame in this video
“vid_id”:”000342". → Video id , it is same for all the frames in this video
“ignore_regions_y”: []. → List of lists denoting the y coordinate of ignore region (the region which is not annotated). This list is empty if there is no region to ignore in this frame
“ignore_regions_x”: []. → List of lists denoting the y coordinate of ignore region (the region which is not annotated). This list is empty if there is no region to ignore in this frame.
“id”:10003420000. → this value is same as the frame_id
}
For the frames which are labeled i.e whose flag “is_labeled” is true, we have the following annotations:
“bbox_head”: []. → A list of 4 elements denoting the bbox of the head location
“Keypoints”: []. → A list of 51 elements (17x3, for each key point -(x,y,v)) representing the 17 key points location mentioned above. The list is in format [x1,y1,v1,x2,y2,v2………….] , where 1 is first key point, 2 is second keypoint ans so on . For each (x,y,v), v is either 0 or 1. 0 denotes that the key point location is not available.
“track_id”: 1. → the tracking ID of the individual, This ID remains constant for that person in all the sequences of that video
“image_id”: 10003420000. → This is the frame ID
“bbox”: []. → List of 4 elements denoting the bbox location of that person
“category_id”:1. → this ID 1 is for Human, This will be 1 in entire data set as they have labelled only persons.
That is it for now but I will add more information in the future as I read more about this data set. Meanwhile, refer to their paper and website — PoseTrack: A Benchmark for Human Pose Estimation and Tracking and https://posetrack.net/
If you find my articles helpful and wish to support them — Buy me a Coffee