Everybody Dance Now with BodyPix

Applying Person Segmentation to Motion Transfer

Terry Rodriguez

Published in

SmellsLikeML

4 min readNov 7, 2019

In collaboration with Salma Mayorquin

In our previous post, we discussed strategies to shorten the time taken to train a ‘do-as-I-do’ motion transfer model.

Recall that in order to perform motion transfer with pix2pix models, researchers first extracted pose estimates from source and target videos.

Now we want to consider bodypix person segmentation as an alternative representation to pose estimates.

Ultimately, the demo was the best way to access the person segmentation model.

Snapping BodyPix

With the addition of some javascript to write images to cloud storage, we could acquire raw image and rainbow colored segmentation masks at 20+ Hz using our browser.

Interestingly, after retrieving some of these images, we found the number of unique RGB values in an example segmentation map to be much greater than the expected 25 values.

Inspecting the image, we find RGB values like: [3, 207, 200], [4, 210, 202], [5, 210, 195], … It seems the colormap has undergone some small random perturbations!

Investigating further, we find the histogram of colors for a gray scaled sample image is NOT sharply peaked around 25 discrete values.

Histogram of Pixel Values for Gray Scaled Sample

Rather the histogram shows a spread that confirms our hypothesis. Since we started with a reference implementation that required a segmentation map, we must map the millions of RGB combinations in our corpus of masks down to 25 shades of gray.

RGB values from a sample image as points in 3D

Regarding each RGB value as a point in 3-dimensional space, we consider a clustering based on the need to encode each image pixel with one of 25 class ids.

Check out this jupyter notebook to see how we used scikit-learn’s implementation of KMeans clustering to correct the colormap.

Ultimately, a more flexible implementation of pix2pix for image-to-image translation better serves our needs.

Motion Transfer with Body Regions, Not Key Points

Though in our previous experiments, we trained GANs on roughly 3–5K samples, this experiment used 1,600 images.

Here, we produce labeled data pairs using the tensorflow.js demo, which acquires samples at 20+ Hz. However, we did not want to relax our assumptions on a typical user’s patience.

Above, we see the GANs perform well in mapping person segmentation masks to renditions of our target dancer.

Finding Funny Faces

It’s noteworthy that the face renditions do not appear as realistic as similar experiments using pose estimation.

Considering the body part mapping, we find that person segmentation resolves to 24 body parts, while the head is comprised of only 2 parts: left/right face.

Comparing to pose estimation, we typically find at least 5 facial keypoints: nose ear/eye (left/right) in the head.

We consider blending both extraction methods for higher spatial resolution across the face AND body.

Transfer Complete

For our source video, we simply pointed a web camera at a monitor playing the video. Nonetheless, we expect a robust segmentation considering the use of generated data in model training.

Running the trained model on our source video, we have the following motion transfer:

Applying Person Segmentation to Motion Transfer

The pix2pix model appears overfit to the small training corpus. Hacking the demo to produce aligned image pairs for our GANs worked well as a proof of concept.

Person Segmentation in TFLite

With the proof of concept above, we show that person segmentation makes an interesting alternative representation for our motion transfer.

In the follow up to this article, we consider 2 approaches to getting more person segmentation data:

First, we consider hacking the demo to process the video assets with javascript rather than the webcam feed.
Alternatively, we fine-tune a similar segmentation model by applying pose estimation and segmentation models to generate training samples.

Thanks for checking out the experiment!