Mark Cuban would be the first to tell you that the Dallas Mavericks are not just in the business of sports but are also in the business of entertainment. Fans may come out to see Luka Dončić hit three-pointers but often what they remember most are the times when their son dove for a t-shirt or their friend embarrassed themself on the “dance cam”. Xpire’s new AI technology offers the Dallas Mavericks a new way to entertain fans.
The team at Xpire AI recently debuted our new video synthesis technology at the Dallas Mavericks game on March 13th. By learning the relationship between pose and real-world images, we are able to transfer motion between two people in a realistic way. Our algorithm, with the help of deep learning, is able to generate life-like videos of Dallas Mavericks’ point guard, Jalen Brunson, doing a variety of popular dances for in-game entertainment. After being fed a short video of a person doing random motions, the algorithm can learn how a person moves and can then generate that person doing motions of your choice. And yes — the same process could help you synthesize videos of your grandmother performing a perfectly choreographed “Thriller” dance routine!
We can think of the solution to this problem as a function: We desire a function that when given a pose (skeleton image), outputs a realistic image of our target person in that same pose. This turns out to be a challenging problem, as we cannot explicitly define rules to perform the task at hand. For example, there is no inherent way to tell our program “if the stick figure has the right hand up then output a realistic image where the target person also has their right hand up”. We’d be writing rules at a pixel-level basis which would not be able to generalize well from image to image, let alone person to person. Because of this, we use deep learning which is a subset of machine learning.
Deep learning allows us to create an algorithm that learns by example. So instead of writing abstract rules, we can simply provide examples of what we want the algorithm to do and have it learn the rules for us. But in order to teach our network to generate images, we must first find a way to extract pose from the source video.
When training a deep learning algorithm (also called a neural network), it helps to remove any unnecessary data from your inputs. In this case, most parts of the source video are not important. For example, extraneous information like what the person looks like, where the person is, or what the person is wearing makes it tougher for the network to learn. All we care about is the position the person is in. Because of this, we need a way to extract only the pose from the video. To do this, we rely on a popular task in deep learning called Pose Estimation.
Pose Estimation is simply the task of locating specific body parts of a person in an image. For instance, in each frame of the video we can use our Pose Estimation neural network to determine exactly where the person’s wrists, elbows, shoulders, knees, etc. are located in the image. We can then connect these “joints” together to create a stick figure skeleton image, which can be used as the input to our network.
Training the Model
Our model used in this project is a special combination of neural networks called a Deep Convolutional Generative Adversarial Network (DCGAN). This network is conditioned on an input image, allowing it to learn the mapping between an input image and an associated output image.While the exact architecture of the model is a bit out of scope for this article, just know that a GAN is made up of two different networks. While one network generates images (the generator), the other learns to tell the difference between real images and generated images (the discriminator). These networks go back and forth generating and discriminating until eventually the generator can create realistic-enough images that fool the discriminator — or get close anyways.
As mentioned earlier, we need to first feed our network many examples of (skeleton image, realistic image) pairs so that it can learn how to generate a realistic frame when given an input pose. To do this, we can record a short training video of our target person (in this example, Jalen Brunson) performing a variety of basic body movements. Next, we can extract the pose from every frame of our training video using our Pose Estimation algorithm to generate a skeleton image of each pose. Now that we have pairs of (skeleton image, realistic image), we can train our network to map the input images to the output images. After training for many hours on a modern GPU, we are left with a model that can effectively map input pose images to realistic output images!
Now that we have a working generator, we can generate videos of Jalen doing anything! To generate a new video, we first must locate a source video. For instance, if we want to make Jalen dance like Bruno Mars we can upload a Bruno Mars video and then run each frame through our Pose Estimation network to generate stick frames. Then we can simply pass these stick frames as an input to our trained generator to create realistic frames of Jalen in the same position! All that’s left to do is stitch the frames together into a video and we’re done. The beauty is that the neural network can generate never-before-seen frames! This means that while the training video may have only had examples of Jalen with his arms up or arms down, the network is able to generate frames of him waving his arms around, doing “jazz hands”, etc.
Though some fuzziness and inconsistencies exist in the videos, we have been able to increase performance by implementing pose normalization, among other techniques. Also, the more data we use to generate the video the better the output will become! In this example we used only a 3 minute training video but you can imagine the improvements we can get when using a 20+ minute training video.
Thanks for checking out this article and as always, please email firstname.lastname@example.org with any questions or comments you may have!
Jesse Stauffer (Co-Founder / CEO)