Introducing BodyPix: Real-time Person Segmentation in the Browser with TensorFlow.js
Posted by: Dan Oved, graduate student and researcher at ITP, NYU. Tyler Zhu, engineer at Google Research. Editing: Irene Alvarado, creative technologist at Google Creative Lab
We are excited to announce the release of BodyPix, an open-source machine learning model which allows for person and body-part segmentation in the browser with TensorFlow.js. With default settings, it estimates and renders person and body-part segmentation at 25 fps on a 2018 15-inch MacBook Pro, and 21 fps on an iPhone X.
What exactly is person segmentation? In computer vision, image segmentation refers to the technique of grouping pixels in an image into semantic areas typically to locate objects and boundaries. The BodyPix model is trained to do this for a person and twenty-four body parts (parts such as the left hand, front right lower leg, or back torso). In other words, BodyPix can classify the pixels of an image into two categories: 1) pixels that represent a person and 2) pixels that represent background. It can further classify pixels representing a person into any one of twenty-four body parts.
This might all make more sense if you try a live demo here.
What can be done with person segmentation? This technology can be a tool for a wide range of applications from augmented reality to photography editing to artistic effects on images or videos. It’s up to you! When we released PoseNet last year (the first model ever that allowed for body estimation (what a Kinect does) in the browser using simple webcams), people came up with a wide variety of use cases for the technology. We hope the same kind of creative experimentation happens with BodyPix.
Why would you want to do this in the browser? Similar to the case of PoseNet, real-time person segmentation was only possible to do before with specialized hardware or hard-to-install software with steep system requirements. Instead both BodyPix and PoseNet can be used without installation and just a few lines of code. You don’t need any specialized lenses to use these models — they work with any basic webcam or mobile camera. And finally users can access these applications by just opening a url. Since all computing is done on device, the data stays private. For all these reasons, we think BodyPix is easily accessible as a tool for artists, creative coders, and those new to programming.
Before we do a deep dive of BodyPix, we want to thank Tyler Zhu from Google Research who’s behind the model and working on human body pose estimation [1, 2], Nikhil Thorat and Daniel Smilkov, the engineers on the Google Brain team behind the TensorFlow.js library, and Daniel Shiffman and the Google Faculty Research Award for helping to fund Dan Oved’s work on this project, and Per Karlsson from Google Research for rendering the simulation dataset.
Getting Started with BodyPix
Let’s dive into the technical details of using this model. BodyPix can be used to segment an image into pixels that are and are not part of a person. Those pixels that are part of a person can be further classified into one of twenty-four body parts. Importantly, the model works for a single person so make sure your input data does not contain multiple people.
Part 1: Importing the TensorFlow.js and BodyPix Libraries
Let’s go over the basics of how to set up a BodyPix project.
The library can be installed with: npm install @tensorflow-models/body-pix and then imported using es6 modules:
Or via a bundle in the page, with nothing to install:
Part 2a: Person Segmentation
At a basic level, person segmentation segments an image into pixels that are part of a person and those that are not. Under the hood, after an image is fed through the model, it gets converted into a two-dimensional image with float values between 0 and 1 at each pixel indicating the probability that the person exists in that pixel. A value called the “segmentation threshold” represents the minimum value a pixel’s score must have to be considered part of a person. Using the segmentation threshold, those 0–1 float values become binary 0s or 1s (ie. if the threshold is 0.5, any values over 0.5 are converted to 1s and any values below 0.5 are converted to 0s).
We call the API method estimatePersonSegmentation to perform person segmentation on an image or video; this short code block shows how to use it:
An example output person segmentation looks like:
For a full explanation of this method and its arguments, refer to the github readme.
Drawing the Person Segmentation Output
Another benefit of BodyPix being in the browser is that we have access to web APIs such as Canvas Compositing. We use this to mask or replace parts of images using the outputs from BodyPix. We’ve provided utility functions that wrap this functionality to get you started:
toMaskImageData
takes the output from estimating person segmentation and generates a transparent image that, depending on the argument maskBackground, will be opaque either where the person or background is. This can then be drawn as a mask on top of the original image using the method drawMask
:
drawMask
draws an image onto a canvas and draws with a specified opacity an ImageData containing a mask on top of it.
Part 2b: Body Part segmentation
Beyond the simple person/no-person classification, BodyPix can segment an image into pixels that represent any of twenty-four body parts. After an image is fed through the model, it gets converted into a two-dimensional image with integer values between 0 and 23 at each pixel indicating which of the 24 body parts that pixel belongs to. Pixels that are outside of the body have a value of -1.
We call the API method estimatePartSegmentation
to perform body part segmentation on an image or video; this short code block shows how to use it:
An example output body part segmentation looks like:
For a full explanation of this method and its arguments, refer to the github readme.
Drawing the Body Part Segmentation Output
With the output of estimatePartSegmentation, and an array of colors indexed by part id, toColoredPartImageData
generates an image with the corresponding color for each part at each pixel. This can then be drawn as a mask on top of the original image onto a canvas using the method drawMask
:
Refer to the github readme for detailed explanations of these methods and how to use them.
How to make BodyPix run faster or more accurately
The model size and output stride have the biggest effects on performance and accuracy — they can either be set to values that make BodyPix run faster but less accurately or more accurately but slower.
- The model size is set when loading the model, with the argument
mobileNetMultiplier
, and can have values of 0.25, 0.50, 0.75, or 1.00. The value corresponds to a MobileNet architecture and checkpoint. The larger the value, the larger the size of the layers, and more accurate the model at the cost of speed. Set this to a smaller value to increase speed at the cost of accuracy. - The output stride is an argument that is set when running segmentation, and can have values of 8, 16, or 32. At a high level, it affects the accuracy and speed of the pose estimation. The lower the value of the output stride the higher the accuracy but slower the speed, the higher the value the faster the prediction time but lower the accuracy.
Additionally, the size of the original image will also have an effect on the performance. Since after estimating segmentation the results are scaled to the original image size, the larger the original image the more computation needs to happen when scaling and drawing the results. To make things faster, try scaling down the image before feeding it to the api.
This is a great place to stop if you’re looking to start tinkering with BodyPix. For those curious to learn how we created the model, the next section goes into more technical detail.
The Making of BodyPix
BodyPix uses the convolutional neural network algorithm. We trained both a ResNet model and a MobileNet model. Although the ResNet based model is more accurate, this blog post concerns the MobileNet one which has been open-sourced for its ability to run efficiently on mobile devices and the standard consumer computers. For the MobileNet model, the traditional classification model’s last pooling and fully connected layer has been replaced by a 1x1 convolution layer in order to predict a dense 2D segmentation map. Below is an illustration of what happens when MobileNet is used to process an input image:
Person Segmentation
At the core of BodyPix is an algorithm that performs body segmentation — or, in other words, performs a binary decision for each pixel of an input image to estimate whether that pixel belongs to a person or not. Let’s go through how this works:
An image is fed through the MobileNet network, and the sigmoid activation function is used to transform the output to a value between 0 and 1, which can be interpreted as the probability of whether a pixel belongs to a person or not. A segmentation threshold (e.g. 0.5) converts this to a binary segmentation by determining the minimum value a pixel’s score must have to be considered part of a person.
Body Part Segmentation
To estimate the body part segmentation, we use the same MobileNet representation, but this time repeat the above process by predicting an additional twenty-four channel output tensor P where twenty-four is the number of body parts. Each channel encodes the probability of a body part being present or not.
Since for each image position, there are twenty-four channels in the output tensor P, we need to find the optimal part among the twenty-four channels. During inference, for each pixel position (u, v) of the output body part tensor P, we select the best body_part_id
with the highest probability using the following formula:
This results in a two-dimensional image (the same size as the original image), with each pixel containing an integer indicating which body part the pixel belongs to. The person segmentation output is used to crop the person from this full image by setting the values to -1 where the corresponding person segmentation output is less than the segmentation threshold.
Training with Real and Simulation Data
It is time-consuming to manually annotate a large amount of training data for the task of segmenting pixels in an image into twenty four body part regions. Instead, we internally use computer graphics to render images with ground truth body part segmentation annotations. To train our model, we mixed the rendered images and real COCO images (with 2D keypoints and instance segmentation annotations). Using the mixing training strategy and a multi-task loss, our ResNet model is able to learn the twenty four body part prediction capability purely from simulated annotations. Finally, we distilled the teacher ResNet model’s prediction on COCO images into our student MobileNet model for BodyPix.
Along with PoseNet, we think BodyPix will be another small step toward motion capture in-the-wild, locally, on a consumer device. There are still quite a few research problems yet to be fully solved: for instance capturing 3D body shape, high frequency soft tissue muscle motion, and detailed clothing appearance and its deformation. Although looking forward, we optimistically envision motion capture technology to be more accessible and useful in a broad range of applications and industries.
We’ve provided just a few examples and utility methods to help you get started with the BodyPix model and hope that it inspires you to tinker with the model. What will you create? We’d love to hear about it, so don’t forget to share your projects using #tensorflowjs, #bodypix, and #PoweredByTF.