Dance x Machine Learning: First Steps

Creating new datasets and exploring new algorithms in the context of the dance performance “discrete figures”

In February 2018 Daito Manabe wrote me an email with the subject “Dance x Math (ML)”, asking if I’d be interested in working on a new project. I have some background working in the context of dance in the past, including 3d scanning with Lisa Parra in 2010, Reactor for Awareness in Motion with YCAM in 2013, and Transcranial with Daito and Klaus Obermaier in 2014–2015.

[Video description: screen recording of software showing a motion captured dancer rendered with basic shapes, and augmented by additional geometric and abstract extensions.]

I was very excited for the possibility of working with Mikiko, and Elevenplay, and again with Daito. Daito and Mikiko shared some initial ideas and inspiration for the piece, especially ideas emerging from the evolution of mathematics. Starting with the way bodies have been used for counting since prehistory, all the way through Alan Turing’s impressions of computers as an extension of human flesh, into modern attempts to categorize and measure the body with algorithms in the context of surveillance and computer vision.

A body counting system from the Oksapmin people in Papua New Guinea. [Image description: sketch of an upper body with numbers indicating 27 points, including fingers, wrist, arm, elbow, shoulder, neck, ear, eyes, nose, etc.]

After long conversations around these themes, we picked the name discrete figures to play on the multiple interpretations of both words. I focused on two consecutive scenes toward the end of the performance, which we informally called the “debug” scene and “AI dancer” scene. The goal for these two scenes was to explore the possibilities of training a machine learning system to generate dances in a style similar to Elevenplay’s improvisation. For the performance in Tokyo, we also added a new element to the debug scene that includes generated dance sequences based on videos captured of the audience before the performance. In this writeup I’ll provide a teardown of the process that went into creating these scenes.

Background

There is a long history of interactive and generative systems in the context of dance. Some of the earliest examples I know come from the “9 Evenings” series in 1966. For example, Yvonne Rainer with “Carriage Discreteness”, where dancers interacted with lighting design, projection, and even automated mechanical elements.

Excerpt from “Carriage Discreteness” by Yvonne Rainer. Still from 16 mm film by Alfons Schilling. [Image description: Black and white still, showing a large plastic sphere suspended from wires and rollers attached to a rope, raised high off the ground, the rafters visible in the background.]

More recently there are artist-engineers who have built entire toolkits or communities around dance. For example, Mark Coniglio developed Isadora starting with tools he created in 1989.

Screenshot from Isadora by Mark Coniglio. [Image description: screenshot of patching environment with rectilinear wires connecting a dozen nodes of varying sizes with icons for microphone input, wavetables, video assets, camera input, and other features.]

Or Kalypso and EyeCon by Frieder Weisse, starting around 1993.

Screenshot of Eyecon by Frieder Weisse. [Image description: screenshot of software running on Windows 2000 with multiple panels for live camera input with some debug overlays, debug text, and parameters. Each panel is a window inside the larger application window titled “EyeCon”.]

I’ve personally been very inspired by OpenEnded Group, who have been working on groundbreaking methods for visualization and augmentation of dancers since the late 1990s.

After Ghostcatching (2010) by OpenEnded Group. [Video description: Computer generated graphics, sketches with a hand-drawn appearance take the form of moving dancers against a black background, leaving trails of light and interacting with virtual environments.]

Many pieces by OpenEnded Group were developed in their custom environment called “Field”, which combines elements of node-based patching, text-based programming, and graphical editing.

Field by OpenEnded Group. [Image description: screenshot of multiple windows running inside early version of Mac OS X including a window with Python code, a visualization/render window, a graphical timeline of events, and icons for running and debugging the software.]

AI Dancer Scene

For my work on discrete figures I was inspired by a few recent research projects that apply techniques from deep learning to human movement. The biggest inspiration is called “chor-rnn” by Luka and Louise Crnkovic-Friis.

chor-rnn (2016) by Luka and Louise Crnkovic-Friis [Image description: human stick figure with a few dozen joints dancing abstract human-like movement.]

In chor-rnn, they first collect five hours of data from a single contemporary dancer using a Kinect v2 for motion capture. Then they process the data with a common neural network architecture called LSTM. This is a recurrent neural network architecture (RNN) which means it is designed for processing sequential data, as opposed to static data like an image. The RNN processes the motion capture data one frame at a time, and can be applied to problems like dance-style classification or dance-generation. “chor-rnn” is a pun from “char-rnn”, a popular architecture that is used for analyzing and generating text one character at a time.

For discrete figures, we collected around 2.5 hours of data with a Vicon motion capture system at 60fps in 40 separate recording sessions. Each session is composed of one of eight dancers improvising in a different style: “robot”, “sad”, “cute”, etc. The dancers were given a 120bpm beat to keep consistent timing.

Example motion capture session: rotation data as quaternions, plotted over time. [Image description: dozens line plots stacked as rows, the y-axis annotated “Joint” 0–20+ and the x-axis annotated “Seconds” 0–250+]

This data is very different than most existing motion capture datasets, which are typically designed for making video games and animation. Other research on generative human motion is also typically designed towards this end, for example, “Phase-Functioned Neural Networks for Character Control” from the University of Edinburgh takes from a joystick, 3d terrain, and walk cycle phase and outputs character motion.

[Video description: 3d-rendered character walking across a checkerboard terrain, following a line of arrows on the ground indicating the target path for the next few steps and previous few steps.]

We are more interested in things like the differences between dancers and styles, and how rhythm in music is connect to improvised dance. For our first exploration in this direction, we worked with Parag Mital to craft and train a network called dance2dance. This network is based on the seq2seq architecture from Google, which is similar to char-rnn in that it is a neural network architecture that can be used for sequential modeling.

Typically, seq2seq is used for modeling and generating language. We modified it to handle motion capture data. Based on results from chor-rnn, we also used a technique called “mixture density networks” (MDN). MDN allows us to predict a probability distribution across multiple outcomes at each time step. When predicting discrete data like words, characters, or categories, it’s standard to predict a probability distribution across the possibilities. But when you are predicting a continuous value, like rotations or positions, the default is to predict a single value. MDNs give us the ability to predict multiple values, and the likelihood of each, which allows the neural network to learn a more complex structure to the data. Without MDNs the neural network either overfits and copies the training data, or it generates “average” outputs.

One big technical question we had to address while working on this was how to represent dance. By default, the data from the motion capture system is stored in a format called BVH, which provides a skeletal structure with fixed length limbs and a set of position and rotation offsets for every frame. The data is mostly encoded using rotations, with the exception of the hip position, which is used for representing the overall position of the dancer in the world. If we were able to generate rotation data with the neural net, then we could generate new BVH files and use them to transform a rigged 3D model of a virtual dancer.

Example BVH file with skeletal structure on left and data frames on right. See here for an example BVH file. [Image description: screenshot of text editor with two columns showing nested data on left side reading “OFFSET, CHANNELS, JOINT” and on the right side showing lists of numbers.]

chor-rnn uses 3D position data, which means that it is impossible to distinguish between something like an outstretched hand that is facing palm-up vs palm-down, or whether the dancer’s head is facing left vs right.

There are some other decisions to make about how to represent human motion.

  • The data: position or rotation.
  • The representation: for position, there are cartesian and spherical coordinates. For rotation, there are rotation matrices, quaternions, Euler angles, and axis-angle.
  • The temporal relationship: temporally absolute data, or difference relative to previous frame.
  • The spatial relationship: spatially absolute data, or difference relative to parent joint.

Each of these have different benefits and drawbacks. For example, using temporally relative data “centers” the data making it easier to model (this approach is used by David Ha for sketch-rnn), but when generating the absolute position can slowly drift.

Using Euler angles can help decrease the amount of variables to model, but angles wrap around in a way that is hard to model with neural networks. A similar problem is encountered when using neural networks to model the phase of audio signals.

In our case, we decided to use temporally and spatially absolute quaternions. Initially we had some problems with wraparound and quaternion flipping, because quaternions have two equivalent representations for any orientation, but it is possible to constrain quaternions to a single representation.

Before training the dance2dance network, I tried some other experiments on the data. For example, training a variational autoencoder (VAE) to “compress” each frame of data.

Original data on left, VAE “compressed” data on right. [Video description: two 3d-rendered stick figures with cubes for joints. The left figure moves smoothly and the right copies the left but with less enthusiasm.]

In theory, if it’s possible to compress each frame then it is possible to generate in that compressed space instead of worrying about modeling the original space. When I tried to generate using a 3-layer LSTM trained on the VAE-processed data, the results were incredibly “shaky”. (I assume this is because I did not incorporate any requirement of temporal smoothness, and the VAE learned a very piecemeal latent space capable of reconstructing individual frames instead of learning how to expressively interpolate.)

Motion capture sequence generated by an RNN trained on VAE-processed data. First two seconds show the seed data. [Video description: one 3d-rendered stick figure with cubes for joints, moving smoothly for two seconds, then moving in a very jittery fashion for nine seconds.]

After training the dance2dance network for a few days, we started to get output that looked similar to some of our input data. The biggest difference across all these experiments is that the hips are fixed in place, making it look sort of like the generated dancer is flailing around on a bicycle seat. The hips are fixed because we were only modeling the rotations and didn’t model the hip position offset.

As the deadline for the performance drew close, we decided to stop the training and work with the model we had. The network was generating a sort of not-quite-human movement that was still somehow reminiscent of the original motion, and it felt appropriate for the feeling we were trying to create in the performance.

Raw output from dance2dance network when we stopped training. [Video description: one 3d-rendered stick figure with cubes for joints moving in smooth somewhat human-like motions that no human would accidentally produce.]

During the performance, the real dancer from Elevenplay 丸山未那子 (MARUYAMA Masako, or Maru) starts the scene by exploring the space around the AI dancer, keeping her distance with a mixture of curiosity and suspicion. Eventually, Maru attempts to imitate the dancer. For me, this is one of the most exciting moments as it represents the transformation of human movement passed through a neural network once again embodied by a dancer. The generated motion is then interpolated with the choreography to produce a slowly-evolving duet between Maru and the AI dancer. During this process, the rigged 3D model “comes to life” and changes from a silvery 3D blob to a textured dancer. For me, this represents the way that life emerges when creative expression is shared between people; the way that sharing can complete something otherwise unfinished. As the scene ends, the AI dancer attempts to exit the stage, but Maru backs up in the same direction with palm outstretched towards the AI dancer. The AI dancer transforms back into the silvery blob and it is left writhing alone in its unfinished state, without a real body or any human spirit to complete it.

[Video description: recording from on-stage camera as it moves around the stage, augmented by a 3d-rendered dancer that appears to float next to the real dancer. The camera feed is projected behind the dancer creating some intermittent video feedback effects.]

Debug Scene

The debug scene precedes the AI dancer scene and acts as an abstract introduction. For the debug scene I compiled a variety of different data from the training and generation process and present it as a kind of landscape for exploring.

[Video description: excerpts from the debug scene, including a 3d-rendered wireframe of a dancer with a stick-figure overlay with a grid of cubes in the background, flying through a point cloud, a grid of dance stills, both generated and real, flying by like a zoetrope.]

There are four main elements to the debug scene, and it is followed by a collection of data captured from the audience before the performance.

In the center is the generated dancer, including the skeleton and rigged 3D model. Covering the generated dancer are a set of rotating cubes, representing rotations of each of most of the joints in the model. On the left and right are 3D point clouds based on the generated data.

[Image description: still from debug scene showing point cloud with a lot of long tendrils of similar colors jutting out in every direction, winding in and out of each other and collapsing back toward the center of the space.]

Each point in the point clouds corresponds to a single frame of generated data. One point cloud represents the raw rotation data, and the other point cloud represents the state of the neural network at that point in time. The point clouds are generated using a technique called UMAP by Leland McInnes.

Example UMAP point cloud generated from the Fashion MNIST dataset. [Image description: point cloud showing clusters separated spatially and by color, with a key for the colors reading “Ankle boot, Bag, Sneaker, Shirt, Sandal…” etc.]

Plotting 2D or 3D points is easy, but when you have more than 3 dimensions it can be hard to visualize. UMAP helps with this problem by taking a large number of dimensions (like all the rotation values of a single frame of dance data) and creating a set of 3D points that has a similar structure. This means points that are close in the high-dimensional space should be close in the low-dimensional 3D space. Another popular algorithm for this is t-SNE.

The final element is the large rotating cube in the background made of black and white squares.

[Image description: dramatic wide-angle view of a made of small numbers and black and white rectangles of various sizes, the cube encloses the entire debug scene, rendered very small in the center of the space.]

This is a reference to a traditional technique for visualizing the state of neural networks, called a Hinton Diagram.

Excerpt from a 1991 paper by Hinton et al. [Image description: five separate rows of dense black and white horizontal lines with black and white rectangles covering portions of the horizontal lines.]

In these diagrams black squares represent negative numbers and white squares represent positive numbers, and the size corresponds to the value. Historically, these diagrams were helpful for quickly checking and comparing the internal state of a neural network by hand. In this case, we are visualizing the state of the dance2dance network that is generating the motion.

The ending sequence of the debug scene is based on data collected just before each performance. The audience is asked to dance for one minute in front of a black background, one person at a time. We show an example dance for inspiration and show realtime pose tracking results to help the audience understand what is being collected. This capture booth was built by 浅井裕太 (ASAI Yuta) and 毛利恭平 (MŌRI Kyōhei) and the example dance features a rigged model of Maru rendered by Rhizomatiks.

[Image description: performance venue lobby with monitor on the right and young kid on the left. The kid is standing on a black carpet with black background with lights illuminating them, looking at the monitor and interpreting a dancer on the screen. Other kids and family watch the screen from the side.]

With each audience member, we upload their dance video to a remote machine that analyzes their motion using OpenPose. On performance days we kept 16 p2.xlarge AWS instances alive and ready to ingest this data, automated by 2bit.

[Video description: video with tracked-skeleton stick figure overlay of two researchers wearing matching jeans and collared shirts, appearing in front of a large hexagonal dome with many wires inside a research facility, holding up their hands and carrying accessories to demonstrate the accuracy of the tracking.]

After analyzing their motion, we train an architecture called pix2pixHD to generate images from the corresponding poses. While pix2pixHD is typically available under a non-commercial license, NVIDIA granted us an exception for this performance.

[Video description: colorful semantic map of a scene from inside a car looking out fade-wipes to reveal an uncanny photorealistic image generated from the semantic map.]

Once pix2pixHD is trained, we can synthesize “fake” dance videos featuring the same person. This process is heavily inspired by “Everybody Dance Now” by Caroline Chan et al.

[Video description: research demo video showing non-dancers moving randomly, and photorealistic video of the same non-dancers generated to follow the movement of real dancers (inset top left) based on their detected pose (inset bottom left).]

In our case, we synthesize the dance during the training process. This means the first images in the sequence look blurry and unclear, but by the end of the scene they start to resolve into more recognizable features. During the first half of this section we show an intermittent overlay of the generated dancer mesh, and during the second half we show brief overlays of the best-matching frame from the original video recording. The pose-matching code was developed by Asai.

While most of discrete figures runs in realtime, the debug scene is pre-rendered in openFrameworks and exported as a video file to reduce the possibilities of something going wrong in the middle of the show. Because the video is re-rendered for every show, a unique kind of time management was required:

  • The doors for the show open one hour before each performance.
  • Each audience member records for one minute with some time before and after for stepping up and walking away.
  • We train and render using pix2pixHD for 15 minutes per person (including the video generation and file transfer from AWS to the venue).
  • It takes 12 minutes to render the video from the generated videos.
  • We must hand off the video for final checks 15 minutes before the lights dim (as the audience is being seated).

This allowed us to include up to 15 audience members in each performance.

Generated images based on video of audience using pix2pixHD. [Image description: grid of images of blurry human-like forms against a black background with some clear legs, jackets, arms, heads, and lots of oversaturated fixed-pattern noise in the medium-brightness areas.]
Best match images from video of audience. [Image description: grid of photos of audience members dancing against black background, with all their arms mostly outstretched to the left.]

While Maru has a chance to experience the process of her movement data being reimagined by machine, this final section of the debug scene gives the audience a chance to have the same feeling. It follows a recurring theme throughout the entire performance: seeing your humanity reflected in the machine, and vice versa.

Future

Next we will be exploring other data representations, other network architectures, and the possibility of conditional generation (generating dances in a specific style or from a specific dancer, or to a specific beat) and classification (determining each of these attributes from input data, for example follow the rhythm of a dancer with generative music). While the training process for these architectures can take a long time, once they are trained the evaluation can happen in realtime, opening up the possibility of using them in interactive contexts.


Credits

discrete figures is almost an hour long, and this article only describes a small piece of the performance. Credits for the entire piece are below, and also available on the project website.

Cast KOHMEN (ELEVENPLAY) / ERISA(ELEVENPLAY) / SAYA (ELEVENPLAY) / KAORI(ELEVENPLAY) / MARU (ELEVENPLAY) / EMMY (ELEVENPLAY) / YU (ELEVENPLAY)
Stage Direction | Choreography MIKIKO
Artistic Direction | Music Daito Manabe (Rhizomatiks Research)
Technical Direction | Hardware Engineering Motoi Ishibashi (Rhizomatiks Research)
Machine learning Direction Kyle McDonald
Machine learning Yuta Asai (Rhizomatiks Research)
Network programming 2bit
Projection System | Software Engineering Yuya Hanai (Rhizomatiks Research)
Visualization Satoshi Horii (Rhizomatiks Research) / You Tanaka (Rhizomatiks Research) / Futa Kera (Rhizomatiks Research)
CG Direction Tetsuka Niiyama (+Ring / TaiyoKikaku Co.,Ltd.)
CG Producer Toshihiko Sakata (+Ring / TaiyoKikaku Co.,Ltd.)
Music Daito Manabe/Hopebox/Kotringo/Krakaur/Setsuya Kurotaki/Seiho
Videographer Muryo Homma (Rhizomatiks Research)
Stage Engineering Momoko Nishimoto (Rhizomatiks Research)
Motion Capture Tatsuya Ishii (Rhizomatiks Research) / Saki Ishikawa (Rhizomatiks Research)
4D-VIEWS Crescent,inc.
Technical Support Shintaro Kamijo (Rhizomatiks Research)
Craft Tomoaki Yanagisawa (Rhizomatiks Research) / Toshitaka Mochizuki (Rhizomatiks Research) / Kyohei Mouri (Rhizomatiks Research)
Promotional Designer Hiroyasu Kimura (Rhizomatiks Design) / Hirofumi Tsukamoto (Rhizomatiks Design) / Kaori Fujii (Rhizomatiks Design)
Production Management Yoko Shiraiwa (ELEVENPLAY) / Nozomi Yamaguchi (Rhizomatiks Research) / Ayumi Ota (Rhizomatiks Research) / Rina Watanabe (Rhizomatiks Research)
Producer Takao Inoue (Rhizomatiks Research)
Production Rhizomatiks co., ltd.