Duet ChoreAIgraphy: Dance Meets AI (Again)

12 min readJul 26, 2024

Greetings everyone! I’m Luis Zerkowski and, for the past couple of months, I’ve spent a lot of my time working on a super exciting Google Summer of Code 2024 project for HumanAI Foundation and I’m very happy to finally be able to share a bit of my progress so far! Welcome to Duet ChorAIgraphy, where we try to explore the beauty of the connections between two dancers in a duet through the lens of machine learning!

This project is heavily inspired on work done by my amazing supervisors (choreography and choreo-graph), who have previously explored the connections between AI and dance for solo performances. During the selection process, I also reproduced one of these pipelines.

Now, we’re taking it a step further and exploring how AI can provide insights to choreographies and inspire creativity for partnering, mostly by visualizing connection subtleties in a duet and generating new dance phrases. On my GitHub repo, you’ll find everything you need to dive into this fascinating world — from code to results, with a very detailed documentation.

Having mentioned them before, I leave here a special thanks to my supervisors, Mariel Pettee and Ilya Vidrin, for all their guidance and support. I also want to thank my work partner, Zixuan Wang, for developing her pipeline alongside mine.

So whether you’re a dancer, a tech enthusiast, or just curious about the intersection of art and AI, I invite you to join me in this adventure. Let’s see how technology can push the boundaries of what’s possible in the world of dance 😁

OK, But What Exactly Is Duet ChorAIgraphy?

It’s a project that aims to implement two pipelines using Graph Neural Networks (GNNs) to study dance duets. The work focuses on two main aspects:

Interpretability of Movements: A pipeline that learns about the connection between the dancers’ bodies in different dance sequences.
Generation of New Sequences: A pipeline that uses these learned connections to generate new dance sequences.

Below, I discuss the preparatory work necessary before effectively implementing the GNNs, and the state of each model together with some results. For specifics on how to set up your own version of the project, I encourage you to check my GitHub repo, which contains a thorough documentation on all steps.

Pose Extraction Pipeline

A key part of any artificial intelligence project is gathering data. To find the best open-source pose extraction tool for the project’s needs, several tools were tested over a few weeks. This section highlights the main tools tested, providing descriptions and comparisons that led to the final choice. It also includes step-by-step instructions on how to install and use the selected tool, making it easy to set up.

Before diving into the technical details, I want to extend special thanks to Ilya Vidrin and his dance team for providing the choreographies used in this project. The images shown here come directly from their performances.

2D Model Exploration

The initial exploration focused on 2D pose extraction models, primarily using AlphaPose. This software, known for its modern models and better pose estimation results compared to most other open-source systems, was thoroughly tested.

All models within the AlphaPose model zoo were evaluated, and four final options were selected, as shown in the figure below. These models represent some of the best average results from the repository and pretty much cover all the possible outputs one can get.

Visualizing pose extractions coming from different models from AlphaPose. From left to right, the models used were: Fast Pose (DCN) from the MSCOCO dataset, Fast Pose (DCN) Combined Loss from the COCO WholeBody dataset, Fast Pose from the Halpe dataset with 26 keypoints, and Fast Pose (DCN) Combined 10 Hand Weight Loss from the Halpe dataset with 136 keypoints.

To more accurately represent the body, with clear markers for the pelvis and torso, the Halpe models were chosen. To evaluate their quality without any preprocessing, the animations below were created.

Visualizing extracted poses coming from models trained on Halpe dataset with 26 keypoints (left) and 136 keypoints (right).

It is clear, however, that both models require further processing for actual usage. They struggle with some frames, losing track of one or both dancers, or repeating poses and placing the dancers in the same position. Additionally, the joints appear unstable, showing significant local vibration and resulting in very shaky poses. While the models promise better results when a tracking pipeline is added, the improvements were found to be minimal, as shown in the GIFs below. Some form of filtering between frames is still needed to achieve more stable pose extraction.

Visualizing 26 keypoints model results without tracking (left) and with tracking (right).

The 2D processing pipelines, however, were not fully implemented and won’t be discussed further here, as the decision was made to use 3D models instead. 3D pose extraction adds crucial information for analyzing and generating choreography, especially for pairs of dancers. It captures the richness of each dancer’s movements and the subtleties of their interactions in the full space, which would be lost if the dimensionality were reduced.

3D Model Exploration

With this in mind, two main 3D pose extraction models were explored: VIBE and HybrIK (the latter being part of the AlphaPose pipeline). From the experiments, it became clear that the model integrated with AlphaPose performs much better than VIBE, excelling both in identifying the instance and in accurately extracting poses. The GIFs below show the pose extraction and mesh reconstruction performed by HybrIK.

Pose extraction joints (left) and mesh reconstruction (right) from HybrIK model.

The rest of this section, therefore, focuses on using AlphaPose for 3D pose extraction. The examples shown from now on will better represent an actual scenario though, as AlphaPose’s version of HybrIK supports multi-instance pose extraction, allowing for the extraction of poses for two dancers simultaneously.

Once the pipeline is chosen, the next crucial step is data processing. Selecting a method doesn’t mean the data is already clean and ready to use. The GIF below shows common issues, and this section explains the solutions implemented to prepare the data for the models.

Visualizing mesh reconstruction coming directly from AlphaPose, with no preprocessing.

The problems addressed were:

Missing frames: When a frame was lost because no pose was identified, we replicated the poses from the previous frame. This solution worked well due to the small number of missed frames and the high sampling rate (approximately 30 FPS), which prevented noticeable impacts on movement.
Frames with only one person: When the model captured only one person in a frame, we compared the sum of Euclidean distances between corresponding joints for the identified person and the two people in the previous frame. We then added the person from the previous frame with the greater distance to the current frame (assuming this was the non-captured person).
Frames with more than two people: When the model identified more than two people in a frame, we retained the two people with the highest confidence scores and removed the rest, as we know in advance that our data only contains two people per frame.
Index matching: When the model lost track of people, swapping their indices back and forth over time, we scanned all frames and used the aforementioned sum of Euclidean distances between corresponding joints to correct the inversions.
Vertex jitter: When the model caused local inaccuracies that varied the positions of vertices beyond the actual movement, making a jitter effect, we tested a few different methods and ended up using a 3D DCT low-pass filter (25% threshold) to smooth the data.

The GIFs below show the original video together with the final results for both dancers after applying the full pipeline.

Original video (left) and pose extraction results (right) after full processing pipeline.

Models

After extensive data processing, attention was turned to developing the models themselves. The use of AI in dance allows for a wide range of creative possibilities. Many different tasks can be explored, but two stand out in this project: dance interpretability and the generation of new phrases.

The first task involves studying the hidden movements of a duet, extracting information that is not visually evident. For example, it might not be immediately clear that the movement of one dancer’s right foot is directly connected to the movement of the other dancer’s hip or left hand. The objective is to model and tackle a Graph Structure Learning problem, uncovering the connections (or different types of connections) between the dancers’ joints. More specific technical details are described in a dedicated section below.

The second task is a natural continuation of the first. Using the connections learned from the first model or maybe even a graph structure defined by a user, the goal is to Generate New Dance Sequences guided by these connections. More clearly, this pipeline aims to create new movements that follow a suggested line. The technical details of this model are also provided in its respective section below, although it remains in the conceptual stage at this point in the project.

Loading Data

The next step in the project involves loading the dance data for AI modeling. First, the preprocessed data is read in an interleaved manner to separately extract data from both dancers. Adjacencies are then created by initializing a default skeleton with 29 joints for each dancer and connecting every joint of one dancer to all joints of the other.

Plot of the two dancers with only their skeletons (left) and fully connected to their partner (right).

The idea behind mapping the skeletons of both dancers in this way is to ensure the model focuses on the connections between them, rather than on the connections within each individual dancer. It’s natural that much of a dancer’s joint movement could be more easily predicted by inspecting their other joints. However, the goal here is to focus on the influences between the dancers, identifying which joints of one dancer influence the other and vice versa. It is also worth noting that, to simplify the initial modeling, this graph is undirected. This approach will be adjusted in the future to evaluate the direction of the influences between each joint of both dancers.

These connections are what the graph structure learning model will classify, initially as existing or non-existing edges for simplicity, but later categorically by the degree of influence in the movement. The data is then prepared for model training by creating batches with PyTorch tensors. The tensors are structured with dimensions representing the total number of sequences, the sequence length, the number of joints from both dancers, and 3D coordinates. Finally, a training-validation split is created to allow for proper model hyperparameter tuning.

Final tensor shapes of dataset for sequences of length 32 frames.

Neural Relational Inference Variant

As the title suggests, this model is a variant of the Neural Relational Inference (NRI) model, which itself is an extension of the traditional Variational Autoencoder (VAE). The primary objective of the original model is to study particles moving together in a system without prior knowledge of their underlying relationships. By analyzing their movement (position and velocity), the model estimates a graph structure that connects these particles, aiming to reveal which particles exert force on the others.

In the context of this project, the particles are represented by the joints of dancers. While the physical connections between joints within a dancer’s body are known, this information alone is insufficient to understand the partnering relationships between two dancers.

Since a target graph structure correctly identifying which joints are virtually connected during a dance performance is unavailable, and considering that this graph can change over time even within a performance, self-supervising techniques are employed — one of the reasons for choosing the VAE framework.

The model consists of an encoder and a decoder, both playing around with transforming node representations into edge representations and vice versa. This approach emphasizes the dynamics of movements rather than fixed node embeddings. Not only that, but the encoder specifically outputs edges, sampling these from the generated latent space, making it essential to switch between representations.

Image adapted from the original NRI paper showing a schematic overview of the final model architecture, including the GCN nodes and the sequence-to-sequence adapatation.

This project’s implementation, even though very similar to the NRI MLP-Encoder MLP-Decoder model, includes a few important modifications:

Graph Convolutional Network (GCN): Some MLP layers are replaced with GCN layers to leverage the graph structure, improving the model’s ability to capture relationships between joints. This change focuses on a subset of edges connecting both dancers rather than studying all particle relationships as in the original implementation. Additionally, GCNs provide local feature aggregation and parameter sharing, important inductive biases for this context, resulting in enhanced generalization in a scenario with dynamic (and unknown) graph structures.
Predicting Sequences: Since the data only includes noisy 3D position of points (and not their velocity), the Markovian property explored by NRI for reconstructions does not hold. Therefore, to predict movement, the model reconstructs entire (small) sequences.
Use of Modern Library: PyTorch Geometric is utilized for its advanced features and ease of use.

By incorporating these modifications, the model maintains the core principles of the original NRI model while theoretically enhancing its ability to generalize and adapt to the dynamic nature of dance performances.

Having said that, the model’s results have not met expectations yet, as the network optimization is still in progress and several challenges are being addressed. One primary issue encountered is a classic problem in VAE training: the model tends to optimize almost exclusively for KL-Divergence, neglecting the reconstruction loss. To mitigate this, the beta coefficient technique was employed, cyclically adjusting the weight associated with the KL-Divergence loss. This approach has shown promising results, enabling a more balanced optimization of both KL-Divergence and reconstruction loss, even though the initial epochs still focus heavily on the former.

Overfitting is another significant challenge. Although hyperparameter space exploration has helped mitigate this issue to some extent, it remains a problem, probably because of the limited amount of training data. With around 600,000 trainable parameters and only approximately 62,000 training sequences, data scarcity is a concern. To address this, data augmentation is being implemented by adding small Gaussian noise to the joint points of both dancers in each frame. However, even if the data is doubled, acquiring additional training sequences is likely necessary to achieve good results.

Graphs for KL-Divergence (left) and Reconstruction Loss (right).

Currently, the reconstruction loss decreases significantly for the training split. For the validation split though, it initially decreases during the first few epochs and then begins to increase. The challenge is that even when employing early stopping to prevent overfitting, the visual results remain pretty bad. This suggests that while the model is overfitting to the training data (evidenced by the rising validation loss), it is also underfitting to the problem overall, as the performance remains poor.

Regularization techniques such as dropout and batch normalization are already in place, and various learning rates and learning rate schedules have been tested. These efforts ratify the necessity of acquiring more data to adequately address the model’s complexity and improve overall performance, together with adaptations on the architecture to better handle sequences.

Example of sampled edges (left) and “reconstructed” sequences (right).

Temporal Model

The goal of this model is to return to a more classical configuration, previously used in the study of individual choreography, employing a temporal architecture to better manage sequences. The key difference is the integration of GNNs into themodel. This allows us to leverage the graph structure, either generated by the previous model or suggested by the user, for predicting new sequences.

There are two main ideas yet to be tested: using GNNs as a preprocessing step to enrich the data before applying a VAE with LSTMs, or directly using a graph recurrent neural network to adapt the Symmetric Convolutional Autoencoder architecture.

Due to the extensive amount of work still required on the previously presented model, these temporal models have not been implemented yet.

Author’s Note

And that was a comprehensive overview of everything I developed over the past two months. I hope it was clear and that you enjoyed reading about it as much as I enjoyed working on it! Stay tuned for the next post in two months, where I’ll share the final updates on the project! In the meantime, have a great day and, perhaps, enjoy working on implementing your own version of the pipeline or enhancing mine!