How Facebook AI used this Network to generate more data to train a Pose-Estimation Model.

Saiyam Bhatnagar
Analytics Vidhya
Published in
4 min readApr 17, 2020

Hello readers. This blog is written in the appreciation of a Research paper I am presently implementing in Tensorflow 2.0. I personally found the aim and the findings of the paper worth sharing with the audience who has not read it by now. Therefore, before diving into the beauty of this paper that is capable of generating more data to train itself and improve its accuracy, I would like to credit the authors — Gedas Bertasius , Christoph Feichtenhofer , Du Tran , Jianbo Shi , Lorenzo Torresani — Facebook AI, University of Pennsylvania.

Modern approaches in deep neural networks for pose estimation require large amount of dense annotations. This means that if you want to train a neural network for Pose-Estimation you would require huge amount of manually labeled frames in a video with the coordinates of the points you would like your model to estimate. This process is very expensive as it requires manual labor. Now, this is where the network proposed in the paper steps it. Named a PoseWarper Network, this paper is capable of labeling frames in video with the aid of only a few labeled frames in a video.

For training, PoseWarper network inputs two frames of the same video- a labeled frame A and an unlabeled frame B which are separated only by a few steps of time. Now both of the frames are fed in to a network(SimpleHRNet) that generates the heatmaps f(A) and f(B). Now for those who are not familiar with what a HRNet is: It is a network that inputs images with a human and outputs the heatmaps (labeled image that specifies the joints of the person). For details of SimpleHRNet click here. For greater understanding refer to the figure given below.

the input and corresponding output of the SimpleHRNet

Now, after having got the two heatmaps (f(A), f(B)) from the pretrained HRNet, begins the trainable region of the PoseWarper network. In brief, this network uses the heatmaps of the unlabeled frame B and labeled frame A and computes the difference in the two. (f(A)-f(B)= phi). Now this phi is sent to the trainable region of the network which further warps phi with f(B) and tries it to match it with the ground truth reality in labeled frame A. This forms the training part of the network. This is the very aim of the PoseWarper network: training the weights so that the pose estimated in the unlabeled frame B matches the ground truth pose in the labeled frame A. Please refer to this pictorial representation for greater clarity.

the training part of PoseWarper

A high level overview of the approach for using sparsely labeled videos for pose detection. In each training video, pose annotations are available only every k frames. During training, the system considers a pair of frames–a labeled Frame A, and an unlabeled Frame B, and aims to detect pose in Frame A, using the features from Frame B.

After training the model we are interested in passing some unlabeled frames and as an output getting those labeled. The very aim is to increase our training data set by continuously generating new data using our PoseWarper to label unlabeled frames in the videos. The picture below shows how model labels unlabeled frames.

labeling the unlabeled frame and generating more data simultaneously

Since the very purpose of the article is to bring out the idea mentioned in the paper and not the implementation details. Yet, we can ponder over the model architecture for greater clarity. Below is the image of the model architecture.

The model architecture

The readers can always read the research work here if they need to know more about the architecture, implementation details, experiments and the model’s capabilities.

--

--