Everybody Dance Now with BodyPix

Applying Person Segmentation to Motion Transfer

Terry Rodriguez
Nov 7 · 4 min read

In collaboration with Salma Mayorquin

In our previous post, we discussed strategies to shorten the time taken to train a ‘do-as-I-do’ motion transfer model.

Recall that in order to perform motion transfer with pix2pix models, researchers first extracted pose estimates from source and target videos.

Now we want to consider bodypix person segmentation as an alternative representation to pose estimates.

Ultimately, the demo was the best way to access the person segmentation model.

Snapping BodyPix

With the addition of some javascript to write images to cloud storage, we could acquire raw image and rainbow colored segmentation masks at 20+ Hz using our browser.

Sample Person Segmentation Mask

Interestingly, after retrieving some of these images, we found the number of unique RGB values in an example segmentation map to be much greater than the expected 25 values.

Inspecting the image, we find RGB values like: [3, 207, 200], [4, 210, 202], [5, 210, 195], … It seems the colormap has undergone some small random perturbations!

Investigating further, we find the histogram of colors for a gray scaled sample image is NOT sharply peaked around 25 discrete values.

Histogram of Pixel Values for Gray Scaled Sample

Rather the histogram shows a spread that confirms our hypothesis, meaning we need to map the millions of RGB combinations in our corpus of masks down to 25 shades of gray.

This is because the reference implementation we first experimented with was quite specialized to motion transfer. As such, it encoded each pixel as one of 25 shades of gray as you would for a segmentation mask after performing pose estimation.

RGB values from a sample image as points in 3D

Regarding each RGB value as a point in 3-dimensional space, we consider a clustering based on the need to encode each image pixel with one of 25 class ids.

Check out this jupyter notebook to see how we used scikit-learn’s implementation of KMeans clustering to correct the colormap.

Another, more general implementation of pix2pix for image-to-image translation did not suffer this shortcoming. Ultimately, this reference provides more flexibility for our needs.

Motion Transfer with Body Regions, Not Key Points

Though in our previous experiments, we trained GANs on roughly 3–5K samples, this experiment used 1,600 images.

That is because we produce labeled data pairs using the tensorflow.js based demo, which acquires samples at 20+ Hz but we did not want to relax our assumptions on a typical user’s patience.

Renderings Generated during Training

Above, we see the GANs perform well in mapping person segmentation masks to renditions of our target dancer.

Though not a controlled experiment, it’s noteworthy that the face renditions do not appear as realistic as similar experiments using pose estimation.

Considering the body part mapping, we find that although person segmentation resolves to 24 body parts, the head is comprised of only 2 parts: left/right face.

Comparing to pose estimation, we typically find at least 5 keypoints: nose ear/eye (left/right) in the head.

We may consider blending both extraction methods for higher spatial resolution across the face AND body if improvements in rendition quality justify the additional processing.

For our source video, we simply pointed a web camera at a monitor playing the video. Nonetheless, we expect a robust segmentation considering the use of generated data in model training.

Running the trained model on our source video, we have the following motion transfer:

Applying Person Segmentation to Motion Transfer

The pix2pix model appears overfit to the small training corpus. Hacking the demo to produce aligned image pairs for our GANs worked well as a proof of concept. But we want to get more data, and faster, so we explore ways to perform person segmentation outside of the browser.

To overcome the limitations of the demo, we will focus on developing our own model to perform person segmentation.

Person Segmentation in TFLite

With the proof of concept above, we show that person segmentation makes an interesting alternative representation for our motion transfer.

To go further in this analysis, we will want a version of the model running locally using tflite so we can leverage the hardware acceleration we described in our previous post using the Coral DevBoard.

In the follow up to this article, we consider 2 approaches to approximating the BodyPix person segmentation model:

  • First, we might attempt to record many input/output pairs to distill the hosted models, as the researchers did to the larger, original model. To avoid overfitting to biases in training data imposed in using our webcam, we consider using WebRTC to process downloaded mp4s.
  • Alternatively, we can we can leverage open datasets framing human figures and apply pose estimation and segmentation models to generate training samples to fine-tune a segmentation model.

Thanks for checking out the experiment!

Follow us on twitter @smellslikeml or check out our blog for more updates.


Follow your nose, which knows the smell of ML. All you can sniff experiments and demos.

Terry Rodriguez

Written by

Data Science- AI — Machine Learning — Entrepreneur


Follow your nose, which knows the smell of ML. All you can sniff experiments and demos.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade