Shadow Art: How TensorFlow powered the AI experiment for Lunar New Years 2019
Posted by Miguel de Andres-Clavera, Creative Technology Lead at Google
In an earlier blog post, we introduced Shadow Art, an AI experiment that celebrates the ancient art of Chinese shadow puppetry. The experiment uses TensorFlow.js to transform users’ hand shadows into digital animals in an interactive game.
In this blog post, we will discuss how we built Shadow Art using TensorFlow.js. All the code for the experience is open source and available on Github.
Introduction to Shadow Art
Shadow Art lets you try your hand (no pun intended!) to form shadow puppets of the twelve Zodiac animals from the lunar cycle in front of your laptop or phone camera. If your shadow puppet matches correctly, it transforms into an animated image of the animal.
In September last year, we built an interactive real-world installation that used TensorFlow to help people explore shadow puppetry. For Lunar New Years, we decided to bring it online so that everyone could play. To accomplish this, we turned to TensorFlow.js.
Bringing Shadow Art online thanks to TensorFlow.js
Bringing this experience to the web required changes to the original offline Shadow Art experience.
First, in the offline experience, we needed to capture users’ hand data and process it on a server, sending images to be processed and stored server-side. Using TensorFlow.js for the web experience, we could load everything once onto the browser and process the entire Shadow Art experience there: capturing hand data, processing it, performing inference, and displaying the results, not to mention other dependencies required for an application. The hardest part is the ML that performs hand data classification; after all that’s where the magic lies and with TensorFlow.js we can bring it to life.
The model compares an input image (user’s hand image) with a given set of classes as templates, to see which one is most similar. With this, we can freely add or remove image templates for each class, or even introduce new classes without retraining the model.
Effectively the ML model learns to compare two images efficiently using a residue network to transform a fixed-length contour into a fixed dimension feature vector.
We compared the features extracted from the user’s hand images to those of the class examples using this metric: loss = -exp(-(x-y)*(x-y)), where x and y is the feature vector obtained from the network.. To do this, during training we extracted the shadow contour from the image data, perform normalization, and rotate each randomly, before feeding them into the training pipeline.
For the comparison metric, we use the negative gaussian as it has bounded edges preventing exploding of gradients. Comparison is a matter of computing distance. The first thought is to use the sum of squared differences. This function is not bound at the limits, which might cause exploding gradients, but its exponential does. That is why we used the exponential of the negative of the sum of squared differences, and multiplying with another negative to pose as a minimization problem.
The initial dataset contains many binary shadow projection images collected from our team members, which was a lot of fun. These images served as training data for the ML model to learn how to compare, and to be a template to match on.
The images contained in the dataset varied in resolution. Due to the nature of these images, a dynamic model like RNN or data preprocessing is required to transform the image into a fixed-dimension feature vector for direct comparison. High variance models like RNN, however, requires more data or we risk overfitting.
Data preprocessing, in our case, contour extraction, is used to transform the image into a fixed-dimension feature vector or in other words, to prepare data to feed into the residual network).
We used TensorFlow with TPU support to train the model, then converted it for use on the web using TensorFlow.js.
Initially, we were doing server-side classification in our app with a large number of users, so heavy server load was expected. We ran multiple experiments to overcome the problem:
- The model can be used directly and no modification was needed.
- Ported model size was 10.7MB which is acceptable.
- We do classifications every time we detect that a user’s hand was still, which takes around one second, and the classification time is barely noticeable.
To classify users’ hand data, we perform one-shot classifications (using only a few number of samples per class) using a modified version of the Residue network that takes a fixed-length hand outline, and from that infers an animal class.
Although it is possible to train the model on browsers with TensorFlow.js, we trained the model using a dedicated backend on TPUs. That pretrained model was then deployed in the browser using TF.js, storing the weights after training to be loaded by the web app. No more training was required in the browser. TensorFlow.js uses a dynamic programming paradigm by default. This allows ideas to be implemented and tested effortlessly on browsers.
To allow maximum control, we customize our own protocol for weight transfer in one go including how to encode and compress both the weights and the templates for learning to compare, and together with the extra data for the web app.
The benefit is that we can build the training pipeline on any existing tensor library not limited to TensorFlow. As long as we could save the weights in the same format and download to our web application thanks to Tensorflow.js.
TensorFlow.js advantages over server-based method
- Responsiveness: Shadow classifications are done client-side and give user feedback in real-time. With a server-based method, images have to be sent to the cloud and introduce delays in classification results.
- Reduced bandwidth usage and dependency: User’s hand images are not sent to the server, significantly reducing bandwidth usage for the user. Furthermore, once the page is loaded, the app will run smoothly independent of the user’s internet bandwidth.
- Reduced server load: Hand shadow classifications are done at a rapid rate, giving real-time feedback to users on how well their hand matches the shadow template. Moving this task to the user’s end can massively reduce server load.
- Simpler web hosting requirement: no need to setup GPU-based cloud service for serving the model, just a simple web hosting service.
- Easier to scale: Apart from the simplicity, web hosting services are easier to setup and scale as well as the hard parts are handled.
On browser data processing
To get the hand contours on browsers, we use OpenCV.js to capture users’ hand images from HTML5 <video> tag via webcam and process them individually. For each image, a simple background subtraction is performed to separate foreground objects (including hands and some noise) from the background. While calibrating, the app collects images from the webcam to construct a base background for future subtraction.
After the subtraction, we process the image of the hands to clear out noise, including contour normalization and resampling, before drawing it back onto the app as a shadow.
Based on preliminary tests, the entire inference pipeline including the preprocessing and classification takes less than a second to finish on an Android phone. This is remarkable considering the limited amount of resources utilized.
Integrating results into the AI experiment
Given the classification results, we determine if a hand posture matches an animal shadow using a threshold on the confidence value returned from the model. This is more intuitive than simply picking the one with the highest confidence value. Consider that when a human determines if a hand shadow looks like a rabbit by gauging how much it looks like rabbit, not by comparing if the hand looks more like a rabbit than other animals. This also allows us to easily fine-tune the difficulty of the app so that it is the most suitable for the users around the world.
Now that we know what animal each user expects, the next step is to link users’ input with the result and transform the hand shadow into an animal figure, after which a pre-recorded animation of the shadow transforming into a live animal is then played to complete a single trial.
Morphing from the user’s hand shadow to the target animal is an awesome part of this AI experiment. To ensure the shape captured morphs seamlessly into the animal, we extract the contours from both shadows: the input hand’s and the target animal’s alike.
Then we optimize to find a proper match between each point in the hand shadow’s contour (source) with those in the animal shadow’s contour (destination), and perform step interpolation to transform from the source to the destination contour.
We do dynamic time warping so we can match distinct features like the ears and smaller ones as well.
Optimizations for the website
Fine control of animations
Videos played on websites are usually in .mp4 format. However, playing .mp4 on the web does not allow fine control of the animations which we need in order to play animations after shadow morphing for a seamless experience.
We converted animations into PNG sequences. In each frame, a certain portion is selected for drawing on canvas. Combined with hand shadow morphing, this allows us to precisely know when to draw the synthetic morph and when to draw the premade videos: users will see the hand morph into a shadow and in turn transformed into a colored animal.
Pre-loading all animations enhances the user experience by reducing glitches when downloading data.
Download size is one of the most crucial aspects for our web application to be more accessible. The initial version required users to download around 200MB of data. So we had to perform various optimizations to reduce the download size.
- PNG size optimization
We first look up which parts of the app consumes the most bandwidth. We found out the the PNG sequences combined together are sized over 180MB. Converting RGB PNGs to palette-based ones that only allow the image color to be sampled from a defined set reduces the file size by over 70%.
- Model template optimization
Our algorithm requires hand shadow templates of each animal to match hands from the user. As in feature extraction process, each hand shadow will be transformed into a contour, which is then extracted into a vector. Therefore, instead of storing hand templates as images, we directly store extracted feature vectors for each template. This also saves a lot of space.
Final thoughts and acknowledgments
Although the current model is limited only for contour-like object classification, its application extends beyond that. The core concept of learning to compare is the ability to change classification objectives without having to retrain the model.
By using Tensorflow.js anyone can now make web applications that make this portable, bringing supervised learning to a more personal level. Having a ready to use object classification tool that anyone can use or teach by simply providing a few class examples is really exciting. It makes it very easy for anyone to customize it to solve any given specific tasks.
We’d like to thank Kiattiyot Panichprecha, Isarun Chamveha, Phatchara Pongsakorntorn, Chatavut Viriyasuthee and Pittayathorn Nomrak for all their help while building this experiment and we look forward to many more useful, creative and fun use cases using the model and Tensorflow.js!