Remixing Motion Capture Data With Conditional Variational Autoencoders

10 min readJul 6, 2020

One of our ML Assisted Animations, SIMULATION: PURE THOUGHT

Nowadays if you start mentioning machine learning based creativity, you’ll immediately hear something like “It’s only a matter of time before creatives will be out of work.”

But our creative teams at Mediamonks are made up of world class talent so why would we want to automate them away? Instead we are interested in augmenting their creative powers.

While advancements in creative machine learning are being made all the time, practical tools seem few and far between. How do we integrate this technology into our creative process now? How would it change the way we create, and what would this new creative pipeline look like?

These are some of the questions we set out to learn on our R&D path towards machine learning based animation tools. Our goal was to create a simple end to end ML based animation pipeline. We wanted to see how it would work, and how much creative control we would have over the output. Our fist ML based animations demonstrate successful attempts at using neural networks to procedurally generate dancing animations given an audio track.

The pipeline is built around a conditional variational autoencoder (CVAE). The input data is depth data of individual dance poses, and the conditional data is the centroid of each depth pose.

The trained CVAE allows us to interpolate through the trained continuous latent space, approximating the effect of animation. The power of this approach is using the neural network to generate and remix animation data given novel music data. An animation dataset with a large enough sample size can be used to “dance” to any music.

The same trained CVAE model used to generate another animation based off a Chopin piece.

The neural network architecture and output aesthetic are purely functional. Our architecture is not the only way or even the best way to do this. Instead it fulfills the generative requirement of our pipeline.

Likewise, the hi-bit pixel art aesthetic is something that works well with the output of the neural network, and is easy to create in our rendering environment, Unity. Technically it’s not even pixel art, but downsampled renders of the 3D environment.

You can find a basic implementation at this github link.

Step 1: Generating Data With Unity

The interface for our custom Unity data generation tool.

Deep Neural Network approaches live and die by the datasets supplied for a given task. But when it comes to generative ML tasks, the availability of diverse and clean datasets is generally lacking.

Therefore we need to figure out how to generate the type of data we are looking for, animation data.

There are a lot of animated 3D models and motion capture data out there, but as of writing there is more ML research regarding 2D images, than 3D rigs and models. So we wanted to figure out how we could get a large 2D animation dataset.

This is where a 3D Game Engine like Unity came in. With a relatively small set of motion capture animations (57 clips) purchased off the Unity Asset Store, we were able to generate around 300,000 pose samples.

Furthermore, we used Unity’s depth buffer to output depth maps instead of rendered images. The depth shader implementation is simple and was found on this site: http://www.shaderslab.com/demo-50---grayscale-depending-zbuffer.html

This allowed us to reconvert the 2D images back into a 3D space once animations are generated by the neural network. This Unity based approach gave us the ability to finely control our data so that we don’t generate poor samples.

The process overall process is as follows:

Find motion capture data or animation data that resembles the type of dancing you are looking for i.e. hip hop or ballet or both.
Create a depth shader so that the Unity Camera’s Depth Buffer is rendered to screen.
Throw all your animation clips into a Unity Animator Controller and render out frames according to the number of samples you want to generate.
Process the samples and create the conditional centroid data using python, openCV, and NumPy.

For this project we used motion capture only. For our input animations we we used the Dance MoCap Collection from Morro Motion. To demonstrate generalization we also used motion capture from Mixamo.

Once we generated all the samples, we processed them and got them ready for our neural network.

Step 2: The Conditional Variational Autoencoder

As far as generative neural networks go, variational autoencoders are pretty simple. You can easily find basic examples in whatever framework you are building your neural networks in.

Our model definition is here

The conditional data in our neural network architecture is the centroid of each pose sample, a two dimensional X,Y coordinate for the pose’s center of mass. It is concatenated with the input sample, and the output latent vector as it goes into the decoder network.

The conditional data primarily helps with reconstruction of animations across the x axis of the image. A vanilla VAE will work, but without the conditional data, we found that the dancers would often teleport instead of animating across the screen.

It’s important to note that pure interpolations between latent space vectors lead to some trippy and unrealistic animations, but this worked within the creative scope of the research project. Our dances also copy segments from the initial dataset and the neural network animates between two different clips.

In this sense the neural network remixes the dataset as well as generates dances. About 20%-30% of our final animation is copied reconstructions from the dataset, and rest is fully generative or stitching the two types together.

We trained the neural network using a Google Compute Engine n1-standard-8 instance (8 vCPUs, 30 GB memory) and NVIDIA Tesla P100. Training with this set up and dataset size took roughly three days to train.

We also trained a model on 100,000 samples taken from 10 animation clips from Mixamo. The training time was about a day at the cost of pose reconstructions and less continuous latent space. This test does demonstrate the ability to generate different dance styles given different datasets.

Animation generated from Unity Asset Store mocap.

Animation generated from Mixamo animation.

Step 3: Generating Dances With a CVAE

The VAE approximates animation by linearly interpolating between two points in a continuous latent space representation. Our animations features interpolations that “dance” to the music.

But we didn’t attempt to train a network that would learn how to dance to music. While this has been demonstrated in various papers, we wanted to go for a simpler approach and use traditional audio analysis tools.

In a sense, it’s a music visualizer with a neural network attached to the output. The basic is approach is as follows:

Analyze the track for beats.
Take the timestamps of two beats and calculate the number of frames you need between them given your desired framerate (our framerate was 30 fps).
Pick two random samples from the dataset and encode them into latent vectors.
Interpolate between those vectors using the calculated amount of frames
Repeat this process for every beat pair in the track.

We used Librosa for our audio analysis tool and used onset detection for our beat tracking. To generate different moods of dances, we set up a threshold for onset strength. For a faster, more aggressive dance we tracked onsets with a low strength threshold, 0 or .15 for example. For slower more balletic we tracked onsets above strength or .25 or .4.

This is the basic approach required to generating animations. Some of the cool effects in our animations build on this by doing some cool interpolation tricks and image layering with openCV.

Two figures with similar but different paths through the latent space.

For instance, to create two figures with slightly different choreography, you essentially just draw similar but slightly different paths through the latent space.

Step 4: Composing the final dance, Rendering The VAE Output in Unity, and Editing in Adobe Premiere.

In order to make the final animations, many animations are rendered per song with various variables governing the output. The clips are then edited together via a multi-camera sequence in Adobe Premiere.

This step is where a lot of the creative synthesis between the neural network and the creative takes place. The neural network can generate a large amount of clips relatively quickly. A single clip with basic parameters can be generated in under 2 minutes, depending on the length of the song.

Our creative process begins with the neural network generating 20–30 clips. The artists looks at these and combines clips that they like. Since each clip dances exactly to the beat, no advanced editing is required, just switching between cameras in the Multi-Camera sequence.

At any point the artist can generate new clips to replace dance sequences that they are not happy with. No traditional animation skills are required.

Once all the clips are edited into a sequence we are happy with, we export it to a video.

Mixing together 9 neural network generated animation clips together.

This video is then loaded into Unity as a video file. A simple controller script plays the video frame by frame as a point cloud. Having the neural network generate input for a 3D game engine allows us total control over the animation camera, instead of being stuck with the input data camera angle.

We created two types of cameras to render the final output. The first was a dynamic camera that automatically followed the figure. In order to do this the centroid of each frame is calculated. The camera follows that centroid plus a user given offset each frame.

The other camera was a VR based system in which we load the generated animation in a VR space, and then capture the position data of our camera for each frame, giving the experience a hand held feel.

A basic VR camera capture tool we made it give more of an intimate feel.

Each frame is rendered in Unity and then saved as a png. After all frames are rendered, they are combined into a video via FFMPEG. Numerous versions of this clip are rendered with different camera work and different scenes.

Once a number of animations are rendered, they are then imported into premiere and edited together via a multi-camera sequence.

Editing multiple renderings of the same neural network generated dance together.

The power of this setup is in the ability to generate new animations on the fly according to our liking. At any point can generate a new dance clip, and re render it as we wish.

This gives us a lot of control over the final animation such as generating animations by tweaking parameters, changing the number of figures, changing the art direction, and changing the camera work. If we want different dance styles, we just have to train the neural network on a different dataset.

It’s all possible because the neural network acts a procedural system that builds itself, pointing towards a future where creatives can easily teach their software what type of content they want to create.

How to improve this process

We set out to explore how a ML assisted creative experience would be built, how it would be used, and make something cool with it.

But there are so many things that can be done to improve the process. In order for this type of tool to make sense, there has to be better access to training data. Perhaps a future tool could have access to a large database of animation data. Otherwise, a ML based approach could be used to convert public videos into motion capture data.

The neural network architecture is perhaps the simplest architecture that achieves our goals. Improving the VAE as well as incorporating a transformer network could create better pose reconstructions as well more accurate animation sequences.

Finally, the tooling could be written to exist in one single software suite instead of multiple. This would be software where traditional 3D creation pipelines are integrated with machine learning tools that learn how to generate the type of content you are looking to create.

Many of the components for a ML based creative pipeline exist already and it’s easy to create smaller tools that work for just our team. But creating user experiences that intuitively work for the everyday users would require a large amount of user research as well as education into what ML can and cannot do.

In the meantime, smaller teams can currently use ML based techniques to enhance their creative and make things that traditionally just can’t make. By integrating a ML engineer into their technical art team, new ways of creating digital art are immediately possible.

Remixing Motion Capture Data With Conditional Variational Autoencoders

Written by Sam Snider-Held