Everybody Dance Faster
Making the 1-hour Motion Transfer Booth
In collaboration with Salma Mayorquin
The result behind “Everybody Dance Now” demonstrates the power of combining image-to-image translation with pose estimation to produce photo-realistic ‘do-as-i-do’ motion transfer.
Researchers acquired roughly 20 mins of video shot at 120 fps while a subject of transfer moves through a full range of body motion.
It is also important to frame the target scene from a perspective similar to that shot for the source video sample of motion transfer. Generally, this means a fixed camera angle at a third person perspective with the subject’s body filling most of the frame.
As sensational as the examples can be, it would be difficult as an experience to deliver from a practical standpoint:
- Requires user to move through full range of motion for 20 mins
- Costly feature extraction and image rendering for training samples
- Involves training custom GAN from scratch
We want to explore model and implementation reductions with the goal of quickly producing ‘reasonable quality’ motion transfer examples in a live demo.
Before framing this further, let’s pause to consider specific challenges to producing qualitatively satisfactory examples.
Gallery of GANs Gone Wrong
In each of the following experiments, we use no more than 3 minutes of sample target video shot at 30 fps.
The first example shows how errors in pose estimation, particularly false positives on shadows and humanoid figures, can cascade into an unrealistic backup dancer.
Next, the framing of this video was too tight with respect to that of the source.
Pose estimation models simply don’t perform well in some body positions. Specifically, occlusion of the head or a relatively low framing of the upper body can impact pose estimate quality. The next example demonstrates an attempt at motion transfer of a yoga flow.
The next two are more convincing but each highlights the challenges in reproducing complex scenes with asymmetry.
Finally, we reach something closer to an entertaining example of motion transfer content.
Motivated by a sense of how our our experimental designs have impacted the quality of the renditions, we can constrain our demo to more consistently produce high quality examples.
Setting the Scene
Simple, symmetric scenes will be easiest to generate. This will help us spend our practical compute budget refining models to produce hi fidelity renditions of the subject dancer.
The researchers emphasized slim fit clothing to limit the challenges of producing wrinkle textures. For our purposes, we assume participants will wear attire typical to a tech or business conference.
Additionally, we want to assume the scene is an adequately lit booth with the space to frame a shot from a similar perspective to that of the source reference video.
The example above shows an idealized setting for our booth after training an image-to-image translation model on roughly 5 thousand 640x480 images.
Note the glitchy frames due to poor pose estimation at the feature extraction step on the source dance video.
This reference implementation was run for roughly 8 hours on a GTX 1080 GPU. We want to get training times down to 1-hour so we will need something quite different.
Next, we discuss some implementation choices to expedite the production of motion transfer examples in a live demo setting.
Estimating Pose at the Edge
Motion transfer involves a costly feature extraction step of running pose estimates over source and target videos.
However, reference source videos are harder to come by and should be assumed to be available in a preprocessed form for our implementation.
Then by performing fast pose estimation on the target video, the remaining time will be spent training the GANs.
To achieve the greatest time resolution using hi-speed cameras, we don’t want to block frame acquisition with inference and streaming and instead write images to an mp4. Then the video file can be queued for asynchronous processing and streaming.
Assuming a realistic time budget from a user in our booth, say 15 seconds, we can increase the number of edgeTPUs & hi-speed USB cameras until we can ensure acquiring sufficiently many training samples for the GANs.
We’ve also seen how pose estimate quality impacts the final result. We choose larger, more accurate models and apply simple heuristics to exploit continuity of motion.
More concretely, we impute missing keypoints and apply time smoothing to pose estimates en-queued into a circular buffer. This is especially sensible when performing hi-speed frame acquisition.
The main impact to final quality comes from poor pose estimates generated from the source video. As valuable reference videos processed ahead of time, these should be corrected manually if necessary.
Streaming the inference results to the cloud, we generate a training corpus for our image-to-image translation models.
Then the main bottleneck to quickly producing one of these examples in in training the GANs.
Yo Dawg, I heard you like to Transfer…
…So we’re gonna apply transfer learning to this motion transfer task.
In other words, having trained a motion transfer model for one target dancer, we can use this model as a warm starting point to fine tune models for other dancers in the same scene.
Our setup thus far takes a few seconds to acquire images before running inference at the edge and pushing the results to the cloud. This means we have one hour to fine tune a model restored from a checkpointed one trained over hours ahead of time on our demo setting from above.
Since we use identical but flawed pose estimates from before, the following examples shows the same ‘glitch’ behavior. This is easily corrected in source video ahead of the demo day.
The above examples used transfer learning from checkpoints already trained to produce reasonable motion transfer renditions in our demo and rooftop environments, respectively. The booth setting on the left trained in only one hour, however, the complex rendition on the right took considerably longer.
This means we can invite users into our booth and let them move through a full range of motion in front of our array of cameras and edgeTPUs for a few seconds.
This setup will be acquiring thousands of photos and running inference in real-time before streaming results to the cloud.
In the cloud, we run a server to train the GAN for our one hour time budget before sending a user video links to hosted renditions.
Twisting the Task
The person segmentation result, BodyPix, was published after “Everybody Dance Now” but offers an alternative to pose estimation for the intermediary representation used in motion transfer.
We might expect the BodyPix alternative to provide:
- a smoother representation of body part location by virtue of representation as a region rather than a point
- 2D regions offer more implicit information on orientation than can be encoded with a line segment
- greater pose resolution with 24 regions compared to 19 keypoints w/ pose estimation
Unfortunately, the model has not been made publicly available. The Medium post describing the work outlined a mixed-training strategy incorporating real and simulated human figures within the training data and a multi-task loss.
For our proof-of-concept, we modify the tensorflow.js demo so we can build the dataset to leverage person segmentation for motion transfer.
The quantized model trades some loss of precision for faster inference to run in the browser but we could add more accurate segmentation models to help process these masks further.
In a follow up post, we discuss the results of training our motion transfer model using person segmentation.
We have looked at ways to incorporate powerful hardware and specialized libraries for performing inference at the edge. We also applied standard techniques like transfer learning to expedite the delivery of motion transfer content.
Additionally, we devised constraints for the scene and showed ways to exploit continuity for higher quality renditions.
Finally, we consider advances to the state-of-the-art in analyzing pose to perform motion transfer with person segmentation.
Looking forward to demonstrating for you, hope you enjoyed the experiments!