Week 2 — Dance like a professional, with AI.
Hello everyone again!
This will be the second of a series of blog posts we will write for our project “Dance like a professional, with ML”. Here you can find our first blog post where we briefly talk about who we are, what the project is about, and what we will try to do.
In this blog post, we made a more detailed review of the paper we are interested in. Let’s dive into it.
Method
It mainly consists of three stages; the first one pose detection, the second is global pose normalization and the third is pose to target video translation. Let’s dig into these three stages.
Encoding the Body Poses
To extract only the pure poses from the given video y, the authors used a pre-trained pose detector (Open Pose, ref).
Global Pose Normalization
Each person’s body shape may differ as well as perspective effects in videos. This variance causes different body proportions in detected pose stick figures. To overcome this problem authors used a pose normalizer. Given two pose stick figures the model measures the heights and ankle points of both subjects and then applies a simple scale and translation to the desired one’s pose stick figure.
Pose to Video Translation
This part consists of two separate generative models, a full-body generator and a face generator.
Full-Body Generation
For a given pose stick figure x the model G(x) generates a correspondent image. To train such generative models, the authors used a GAN architecture which contains a pair of generators and discriminator. While the generator tries to generate realistic “fake” images as it can be to fool the discriminator, the discriminator tries to discern fake and real images. Also since we are dealing with videos instead of a single image, we need temporal coherence so that each frame should be fit to the previous and the next one. Because of that, they applied two frame base generation. Instead of dealing with one frame, the model first generates an image for given pose stick figure x_t and zero vector z then generates the next frame for given pose stick figure x_t+1 and previously generated image G(x_t). Moreover they feed the discriminator with two pair of [x_t, x_t+1] and [G(x_t), G(x_t+1)] as “fake” label; [x_t, x_t+1] and [y_t, y_t+1] as “real” labels.
Face Generation
Since the full-body generator can’t detailed facial expressions, the authors used a different generator for specifically the faces. After generating the fake image for a given pose-stick, the model takes a clip of the face from the generated image and feeds it to a facial generator model along with a pose stick. That generator model generates a face for given two inputs and the discriminator tries to discriminate synthetic faces from the original input image. After generating the face patch they just add the patch to the generated image G(x).
Full Pipeline
The pipeline consists of two phases, train, and transfer.
Training Phase
Here, we are training the generator model with a reconstruction task. First for given y_t and y_t+1 video frames pose detector creates the pose stick figure. Then the full-body and facial GAN models update their parameters for the objectives given above.
Even though the authors didn’t explicitly state, training should be considered on a training dataset that contains not-diverse videos. In other words, since the generator uses pose stick figures to generate images it will not be aware of contextual visual features unless the error signals to come from the discriminator. So the generator should be trained in an invariant training environment.
Transfer Phase
In this phase, we are doing the actual motion transfer part. For the given source video, the video that we wanted to transfer dance moves, the model first detects the pose stick figure. Afterward, it applies the pose normalization that we mentioned above. Now as a final part we are feeding that normalized pose stick figure into the generator of the full-body GAN architecture that we trained on the trained phase along with the generator of face GAN. These generated sequences of frames will be our final outputs for dance motion transfer.
That’s it for this week, see you next time.