Everybody Dance Now with BodyPix
Applying Person Segmentation to Motion Transfer
In collaboration with Salma Mayorquin
In our previous post, we discussed strategies to shorten the time taken to train a ‘do-as-I-do’ motion transfer model.
Recall that in order to perform motion transfer with pix2pix models, researchers first extracted pose estimates from source and target videos.
Now we want to consider bodypix person segmentation as an alternative representation to pose estimates.
Ultimately, the demo was the best way to access the person segmentation model.
Interestingly, after retrieving some of these images, we found the number of unique RGB values in an example segmentation map to be much greater than the expected 25 values.
Inspecting the image, we find RGB values like: [3, 207, 200], [4, 210, 202], [5, 210, 195], … It seems the colormap has undergone some small random perturbations!
Investigating further, we find the histogram of colors for a gray scaled sample image is NOT sharply peaked around 25 discrete values.
Rather the histogram shows a spread that confirms our hypothesis, meaning we need to map the millions of RGB combinations in our corpus of masks down to 25 shades of gray.
This is because the reference implementation we first experimented with was quite specialized to motion transfer. As such, it encoded each pixel as one of 25 shades of gray as you would for a segmentation mask after performing pose estimation.
Regarding each RGB value as a point in 3-dimensional space, we consider a clustering based on the need to encode each image pixel with one of 25 class ids.
Check out this jupyter notebook to see how we used scikit-learn’s implementation of KMeans clustering to correct the colormap.
Another, more general implementation of pix2pix for image-to-image translation did not suffer this shortcoming. Ultimately, this reference provides more flexibility for our needs.
Motion Transfer with Body Regions, Not Key Points
Though in our previous experiments, we trained GANs on roughly 3–5K samples, this experiment used 1,600 images.
That is because we produce labeled data pairs using the tensorflow.js based demo, which acquires samples at 20+ Hz but we did not want to relax our assumptions on a typical user’s patience.
Above, we see the GANs perform well in mapping person segmentation masks to renditions of our target dancer.
Finding Funny Faces
Though not a controlled experiment, it’s noteworthy that the face renditions do not appear as realistic as similar experiments using pose estimation.
Comparing to pose estimation, we typically find at least 5 keypoints: nose ear/eye (left/right) in the head.
We may consider blending both extraction methods for higher spatial resolution across the face AND body if improvements in rendition quality justify the additional processing.
For our source video, we simply pointed a web camera at a monitor playing the video. Nonetheless, we expect a robust segmentation considering the use of generated data in model training.
Running the trained model on our source video, we have the following motion transfer:
The pix2pix model appears overfit to the small training corpus. Hacking the demo to produce aligned image pairs for our GANs worked well as a proof of concept. But we want to get more data, and faster, so we explore ways to perform person segmentation outside of the browser.
To overcome the limitations of the demo, we will focus on developing our own model to perform person segmentation.
Person Segmentation in TFLite
With the proof of concept above, we show that person segmentation makes an interesting alternative representation for our motion transfer.
To go further in this analysis, we will want a version of the model running locally using tflite so we can leverage the hardware acceleration we described in our previous post using the Coral DevBoard.
In the follow up to this article, we consider 2 approaches to approximating the BodyPix person segmentation model:
- First, we might attempt to record many input/output pairs to distill the hosted models, as the researchers did to the larger, original model. To avoid overfitting to biases in training data imposed in using our webcam, we consider using WebRTC to process downloaded mp4s.
- Alternatively, we can we can leverage open datasets framing human figures and apply pose estimation and segmentation models to generate training samples to fine-tune a segmentation model.
Thanks for checking out the experiment!