Duet ChoreAIgraphy: Dance Meets AI (Again) Part 2
Greetings everyone! I’m Luis Zerkowski and, for the past four months, I’ve spent a lot of my time working on a super exciting Google Summer of Code 2024 project for HumanAI Foundation and I’m very happy to come back with my final updates! Welcome once again to Duet ChorAIgraphy, where we try to explore the beauty of the connections between two dancers in a duet through the lens of machine learning! If you missed the last post and want to catch up on the project’s timeline, check it out here!
This project is heavily inspired on work done by my amazing supervisors (choreography and choreo-graph), who have previously explored the connections between AI and dance for solo performances. During the selection process, I also reproduced one of these pipelines.
Now, we’re taking it a step further and exploring how AI can provide insights to choreographies and inspire creativity for partnering, mostly by visualizing connection subtleties in a duet. On my GitHub repo, you’ll find everything you need to dive into this fascinating world — from code to results, with a very detailed documentation.
Having mentioned them before, I leave here a special thanks to my supervisors, Mariel Pettee and Ilya Vidrin, for all their guidance and support. I also want to thank my work partner, Zixuan Wang, for developing her pipeline alongside mine.
So whether you’re a dancer, a tech enthusiast, or just curious about the intersection of art and AI, I invite you to join me in this adventure. Let’s see how technology can push the boundaries of what’s possible in the world of dance 😁
OK, But What Exactly Is Duet ChorAIgraphy?
It’s a project that aims to implement a pipeline using Graph Neural Networks (GNNs) to study dance duets. The work focuses mostly on Interpretability of Movements: learning about the connection between the dancers’ bodies in different dance sequences.
Below, I discuss the preparatory work necessary before effectively implementing the GNN, and the state of the model together with some results. For specifics on how to set up your own version of the project, I encourage you to check my GitHub repo, which contains a thorough documentation on all steps.
Pose Extraction Pipeline
A key part of any artificial intelligence project is gathering data. To find the best open-source pose extraction tool for the project’s needs, several tools were tested over a few weeks. This section highlights the main tools tested, providing descriptions and comparisons that led to the final choice. It also includes step-by-step instructions on how to install and use the selected tool, making it easy to set up.
Before diving into the technical details, I want to extend special thanks to Ilya Vidrin and his dance team for providing the choreographies used in this project. The images shown here come directly from their performances.
2D Model Exploration
The initial exploration focused on 2D pose extraction models, primarily using AlphaPose. This software, known for its modern models and better pose estimation results compared to most other open-source systems, was thoroughly tested.
All models within the AlphaPose model zoo were evaluated, and four final options were selected, as shown in the figure below. These models represent some of the best average results from the repository and pretty much cover all the possible outputs one can get.
To more accurately represent the body, with clear markers for the pelvis and torso, the Halpe models were chosen. To evaluate their quality without any preprocessing, the animations below were created.
It is clear, however, that both models require further processing for actual usage. They struggle with some frames, losing track of one or both dancers, or repeating poses and placing the dancers in the same position. Additionally, the joints appear unstable, showing significant local vibration and resulting in very shaky poses. While the models promise better results when a tracking pipeline is added, the improvements were found to be minimal, as shown in the GIFs below. Some form of filtering between frames is still needed to achieve more stable pose extraction.
The 2D processing pipelines, however, were not fully implemented and won’t be discussed further here, as the decision was made to use 3D models instead. 3D pose extraction adds crucial information for analyzing and generating choreography, especially for pairs of dancers. It captures the richness of each dancer’s movements and the subtleties of their interactions in the full space, which would be lost if the dimensionality were reduced.
3D Model Exploration
With this in mind, two main 3D pose extraction models were explored: VIBE and HybrIK (the latter being part of the AlphaPose pipeline). From the experiments, it became clear that the model integrated with AlphaPose performs much better than VIBE, excelling both in identifying the instance and in accurately extracting poses. The GIFs below show the pose extraction and mesh reconstruction performed by HybrIK.
The rest of this section, therefore, focuses on using AlphaPose for 3D pose extraction. The examples shown from now on will better represent an actual scenario though, as AlphaPose’s version of HybrIK supports multi-instance pose extraction, allowing for the extraction of poses for two dancers simultaneously.
Once the pipeline is chosen, the next crucial step is data processing. Selecting a method doesn’t mean the data is already clean and ready to use. The GIF below shows common issues, and this section explains the solutions implemented to prepare the data for the models.
The problems addressed were:
- Missing frames: When a frame was lost because no pose was identified, we replicated the poses from the previous frame. This solution worked well due to the small number of missed frames and the high sampling rate (approximately 30 FPS), which prevented noticeable impacts on movement.
- Frames with only one person: When the model captured only one person in a frame, we compared the sum of Euclidean distances between corresponding joints for the identified person and the two people in the previous frame. We then added the person from the previous frame with the greater distance to the current frame (assuming this was the non-captured person).
- Frames with more than two people: When the model identified more than two people in a frame, we retained the two people with the highest confidence scores and removed the rest, as we know in advance that our data only contains two people per frame.
- Index matching: When the model lost track of people, swapping their indices back and forth over time, we scanned all frames and used the aforementioned sum of Euclidean distances between corresponding joints to correct the inversions.
- Vertex jitter: When the model caused local inaccuracies that varied the positions of vertices beyond the actual movement, making a jitter effect, we tested a few different methods and ended up using a 3D DCT low-pass filter (25% threshold) to smooth the data.
The GIFs below show the original video together with the final results for both dancers after applying the full pipeline.
Model
After extensive data processing, attention was turned to developing the model itself. The use of AI in dance allows for a wide range of creative possibilities. Many different tasks can be explored, but one stands out in this project: dance interpretability to help in the generation of new phrases.
The task involves studying the hidden movements of a duet, extracting information that is not visually evident. For example, it might not be immediately clear that the movement of one dancer’s right foot is directly connected to the movement of the other dancer’s hip or left hand. The objective thus is to model and tackle a Graph Structure Learning problem, uncovering the connections (or different types of connections) between the dancers’ joints. More specific technical details are described in a dedicated section below.
Loading Data
The next step in the project involves loading the dance data for AI modeling. First, the preprocessed data is read in an interleaved manner to separately extract data from both dancers. Adjacencies are then created by initializing a default skeleton with 29 joints for each dancer and connecting every joint of one dancer to all joints of the other.
The next step in the project involves loading the dance data for AI modeling. First, the preprocessed data is read in an interleaved manner to separately extract data from both dancers. Adjacencies are then created by initializing a default skeleton with 29 joints for each dancer and connecting every joint of one dancer to all joints of the other.
The purpose of mapping both dancers’ skeletons in this way is to ensure the model focuses on the connections between them, rather than the connections within each individual dancer. While a dancer’s joint movements can often be predicted by analyzing their other joints, the aim here is to highlight the interactions between the dancers, identifying which joints of one dancer influence the other and vice versa. Additionally, both directions of the edges are considered, allowing the model to assess the direction of influence between each joint of the two dancers.
These connections are what the graph structure learning model will classify categorically by the degree of influence in the movement, ranging from non-existing to core connections. The data is then prepared for model training by creating batches with PyTorch tensors. The tensors are structured with dimensions representing the total number of sequences, the sequence length, the number of joints from both dancers, and 3D coordinates + 3D velocity estimates. Finally, a training-validation split is created to allow for proper model hyperparameter tuning.
To include data augmentation and improve model generalization, the training pipeline incorporates a data processing step that involves rotating batches of data. Each batch is rotated along the Z-axis by a randomly selected angle while maintaining the original X and Y-axis orientations for physical consistency. This approach helps prevent the model from overfitting to the dancers’ absolute positions.
Due to the high complexity of the problem, both in the number of moving particles and the number of edges in the graph, data simplification was implemented to achieve reasonable reconstruction performance and improve reliability during the sampling of influential edges. Random sampling of joints is applied, and only the subproblem of connecting the sampled joints is studied. This approach led to interesting results and a proof of concept that can be further explored to understand how to scale it.
Neural Relational Inference Variant
As the title suggests, this model is a variant of the Neural Relational Inference (NRI) model, which itself is an extension of the traditional Variational Autoencoder (VAE). The primary objective of the original model is to study particles moving together in a system without prior knowledge of their underlying relationships. By analyzing their movement (position and velocity), the model estimates a graph structure that connects these particles, aiming to reveal which particles exert force on the others.
For a dedicated study on the architecture’s efficacy, some experiments discussing results based on a charged particles simulation dataset from the original paper were conducted. This is because, during the development of the model for the dance scenario, challenges arose in reconstructing frames and predicting edges. These issues, along with extensive model iterations, data augmentations, and various experiments, led to questioning the architecture’s ability to learn the desired dynamics, prompting further testing.
Adressing a Simpler Problem
To better assess the model’s capabilities, the decision was made to revisit the original NRI paper and use their charged particles dataset. The original problem was much simpler — featuring ten times fewer particles, hundreds of times fewer edges, a 2D environment and a greater availability of data — providing a clearer benchmark for evaluating the NRI variant’s performance on a less complex task.
The experiment provided valuable insights. Even in a simplified version with significantly reduced hidden dimensions, the model showed the ability to produce reasonable reconstructions and edge predictions. Although the results are not perfect, the model’s potential to address the graph dynamics problem became clear.
The data generation process involved using the data/generate_dataset.py script with the default settings from the original project repository, but with the option to simulate charged particles in motion. This resulted in a training dataset of 50000 simulations, each simulation including 5 particles with 2D coordinates and 2D velocities, moving over 49 frames. The interactions between charges are defined by a random undirected graph, which controls the particles’ accelerations, affecting their speed and position in each frame.
To prepare the model input, the original sequence of 49 frames is first reduced to 48 frames to simplify testing different sequence splits. Subdivisions are then tested using sub-trajectories of 8, 12, 16, or 24 frames. To ensure the training process remains within a reasonable time frame (under 24 hours), non-overlapping sequences are used, meaning each simulation is divided into non-overlapping segments of 48/sequence_size.
In addition, the input needs to be prepared from the graph’s perspective. The approach, similar to the original choreography model, involves using a fully connected undirected graph. This allows the model to later sample edges based on the input sequence and retain only those relevant for reconstructing the trajectory. The image below illustrates an example of the input graph from a random frame in a random sequence.
Using samples from the data described above, different architectures were tested to assess the model’s ability to reconstruct sequences. In an autoencoder and for a self-supervised task, it’s important not only to look at loss curves but also to observe the model’s outputs. In this type of architecture, and particularly in a subjective context like duet choreography, the sampled edges in the latent space may seem reasonable even when the model is completely off track.
Since the reconstruction heavily relies on the sampled edges, evaluating the quality of the reconstructions helps to better judge the quality of the interactions suggested by those edges. This effect is evident in some experimental results. Usually, when a particle has no sampled interactions, it stays still in the center. Also, particle movement is naturally linked to the particles they interact with, so good edge sampling is crucial for accurate reconstructions.
To avoid making this subsection any longer, we present here only one of the top-performing models. It’s the compact architecture, a simplified version of the originally implemented encoder, and a model with 6-frame sequences, but with more edge types (4 instead of binary). The goal was to reduce the complexity of the architecture to see if the model could still capture the relationships between particles and to determine if the problem could be solved with fewer data transformations.
In the end, it becomes clearer that the implemented architecture shows potential for addressing such a task effectively. In a simplified scenario with more abundant data, it was observed that after just a few iterations (between 10 and 20 epochs, depending on the model’s training time), the reconstruction loss curves displayed typical and more controlled behavior.
Back to the Original Problem
Going back to the original context of the project, the particles are represented by the joints of dancers. While the physical connections between joints within a dancer’s body are known, this information alone is insufficient to understand the partnering relationships between two dancers.
Since a target graph structure correctly identifying which joints are virtually connected during a dance performance is unavailable, and considering that this graph can change over time even within a performance, self-supervising techniques are employed — one of the reasons for choosing an autoencoder framework.
The model consists of an encoder and a decoder, both playing around with transforming node representations into edge representations and vice versa. This approach emphasizes the dynamics of movements rather than fixed node embeddings. Not only that, but the encoder specifically outputs edges, sampling these from the generated latent space, making it essential to switch between representations.
This project’s implementation, even though very similar to the NRI MLP-Encoder MLP-Decoder model, includes a few important modifications:
- Graph Convolutional Network (GCN): Some Linear layers are replaced with GCN layers to leverage the graph structure, improving the model’s ability to capture relationships between joints. This change focuses on a subset of edges connecting both dancers rather than studying all particle relationships as in the original implementation. Additionally, GCNs provide local feature aggregation and parameter sharing, important inductive biases for this context, resulting in enhanced generalization in a scenario with dynamic (and unknown) graph structures.
- Graph Recurrent Neural Network (GRNN) Decoder: To make better use of sequential information and achieve a more suitable final embedding for predicting (or reconstructing) the next frame, beyond just spatial information from the graphs, it is essential to use a recurrent network. The decoder is therefore implemented with LSTM nodes in the original sequence, while also using the graph structure sampled from the latent space generated by the encoder.
- Custom GCN-LSTM Cells: To utilize the recurrent structure crucial for sequence processing while maintaining graph information and GNN architecture, the classic LSTM cell has been reimplemented with GCN nodes. In the final version of the architecture, only the decoder incorporates the recurrent component, which generates a final sequence embedding that the model uses to reconstruct the next frame.
By incorporating these modifications, the model maintains the core principles of the original NRI model while theoretically enhancing its ability to generalize and adapt to the dynamic nature of dance performances.
Encoder
- The encoder begins with a GCN layer, which transforms node representations into edge representations.
- This is followed by a Linear layer, batch normalization, and dropout.
- Next, the edge representations are converted back into nodes, and another GCN layer is applied.
- The nodes are then transformed back into edges, followed by another Linear layer with a skip connection from the previous dropout layer.
- Finally, a Linear layer generates logits that represent edge types, ranging from non-existent edges to those most critical for the movement being analyzed.
Decoder
- The decoder starts by sampling a Gumbel-Softmax distribution using the logits generated by the encoder. This approach approximates sampling in a continuous distribution and employs Softmax to handle the reparameterization trick, ensuring the pipeline remains fully differentiable.
- With the newly sampled edge index, the decoder processes the data through a GRNN composed of modified LSTM nodes with GCN layers, followed by a transformation of the final sequence embedding into edge representations.
- This is followed by a Linear layer, batch normalization, and dropout.
- Finally, the edge representations are converted back into nodes, and a GCN layer is applied to predict (or reconstruct) the next frame.
Results
Getting this model to work was quite a challenge. The inherent complexity of Variational Autoencoders alone introduces numerous training difficulties. When combined with the complexity of the problem itself, the nuances of working on graph neural networks with dynamic graphs, and the task of generating a latent space that approximates a discrete distribution, it becomes a recipe for confusion. Understanding each part that needed adjustment, up to the challenge of training the neural network itself, was a long and involved journey. Still, in the end, some interesting results were achieved.
Given the subjective nature of the problem, it’s hard to definitively evaluate the quality of the sampled edges. As such, the best analyses focus on the loss curves and the reconstructions obtained. The reasonableness of these two elements serves as a proxy for evaluating the sampled edges. In addition, some observations about patterns in edge sampling and a personal evaluation of the predicted relationships are included.
Due to the slow training process, the models discussed here were trained for 20 epochs. The resulting loss curves show healthy reconstruction error, and when looking at the validation dataset, there is no evidence of overfitting. However, the stagnation of the loss in the validation set suggests that the potential for significant model improvement with further training is uncertain. This could only be better explored by continuing training for more epochs.
On the other hand, the loss from the KL-Divergence is less nice than expected. It decreases sharply in the first few epochs and then plateaus. This is an unwanted effect, which can affect the quality of the latent space. Even though the phenomenon is reduced by training with beta coefficients, it is less effective in short training runs since these coefficients depend on the number of epochs.
Regarding the reconstructions, there is significant variability. They are extremely sensitive to the sampled edges, which makes sense because, in a graph network, information spreads through neighboring nodes. If a joint is not connected to others by an edge, its reconstruction degenerates, and the particle stays stationary at the origin of the coordinate frame. It’s common to find some of these in the reconstructions because the focus is on a specific category of edges among all edge types — those considered essential to the dancers’ interactions. This limits the sampling, especially due to the prior distributions, which assumes the interaction being targeted is rare.
For the better reconstructions, it’s clear that the particles are well positioned. The sampled dancers’ particles are positioned accurately in space, particularly in relation to the rest of the body. Additionally, the particles move in the same direction as the overall body movement, supporting good relative positioning. Given that the sampled connections only link particles between the dancers, it’s impressive to see the valuable interaction captured in predicting their movement. It’s evident, however, that the best reconstructions occur when a particle has more than one connection, allowing information to propagate from a dancer to the other one and back through two hops.
Still, noticeable shaking in these particles is present and likely caused by two main factors: the inaccuracy in particle location and velocity in the original sequence, which already shows significant jitter, and the fact that the reconstructions are generated independently, although with the same sampled graph. This is because the best model version only predicts the next frame of a sequence. The impact of using this reconstructed frame as part of the input sequence to predict the next future frames in a chain was not explored.
Bad reconstructions unfortunately occur in several cases. First, as mentioned before, the reconstructions are highly sensitive to the sampled edges, and depending on the graph, they can be very poor, with the sampled joints barely moving. The model also struggles with sequences where the dancers switch sides, causing the reconstructed joints to remain in the initial position where the dancers started.
In more extreme cases, when the dancers move away from the center of the coordinate frame, the reconstructions tend to stay near the center. This issue comes from a normalization issue in the data processing pipeline that was identified later. The normalization of both dancers was removed to capture their relative movements, but a layer that normalized their combined movement was overlooked. As a result, the model faces more difficulty learning and dealing with these corner cases.
Despite the challenges, the results achieved are still quite interesting and much better than before. The loss curve shows a much more normal behavior, and the reconstruction results, while still facing significant issues, are becoming more reasonable. Given that each frame is reconstructed individually, making it difficult to maintain consistency between sequential frames, the particle positioning finally moved in the right direction. This made it possible to examine the learned edge distribution and the sampled edges across different examples to better understand how the model perceives the connections between particles, which, in turn, aids in reconstructing the frames.
It’s also evident that the number of edges sampled with a confidence above 80% is consistently small, roughly matching the percentage of core edges in the prior distribution (both for 3 or 4 edge types). This indicates that the learned latent space indeed reflects the initial suggested distribution.
Now regarding the edges specifically, a few clear patterns emerge. First, among the core edges, most have the same confidence, with normally only one or two edges standing out less. This suggests a low hierarchy among the sampled edges, meaning that once an edge is part of the core group, it has a significant role in information propagation.
Additionally, it’s common to find a particle connected by multiple edges. This likely relates to the earlier observation that information spreads more effectively within a dancer when there is a path that leaves and returns to the same particle within two hops.
Another important pattern, though more subjective, is the connection between particles that are pending in opposition, like a string either being stretched or pulled between them. It seems that connections are more easily formed when particles are in this state of tension. This is intriguing because it reflects a key element of many movements in the choreography dataset, aligning with the project’s core goal: to help dancers recognize the subtle dynamics in their partnering relationships.
Future of the Project
Despite the interesting results, there is clearly room for improvement. The main issues are the reconstructions, which remain unreliable and overly sensitive to the sampled edges, and the fact that only sampled joints are analyzed, not the dancers as a whole. However, there are many other challenges still to be addressed. Considering all of this, several next steps are suggested for further development of this project:
- Data collection: Regardless of the use of data augmentation through duet rotation, the small dataset size made it difficult to train a complex network, especially given the difficulty of the problem.
- Data quality: The pipeline used to extract 3D poses from video, though functional, has room for improvement. Even in the original sequences, the dancers’ particles are shaky and often poorly approximated, leading to extreme and random movements (unrealistic). Moreover, normalization of both dancers was removed to preserve relative movement, but it was realized too late that a new layer normalizing their joint movement should have been added. Having dancers in different parts of space caused confusion for the model.
- Architecture exploration: While many versions of the NRI variant were implemented and tested, the final version is still far from being ideal, and there hasn’t been a true breakthrough moment. There is still significant potential in this architecture, especially given the strong results achieved by the original version.
- Processing speed: Several parts of the final architecture are custom implementations, so they are far from optimized. Batches are replaced with sequential operations at several points in the pipeline: in transformations between node and edge representations, in the decoder, since a new set of edges is sampled for each sequence in a batch, and in the GRNN due to its sequential nature. As a result, a training cycle with just a few dozen epochs can take an entire day. Optimizing this process is crucial to the project scaling.
- Interaction with dancers: Since this project sits at the intersection of art and technology, direct interaction with artists is essential. In a more refined version of the model with consistent results, it would be ideal to present the tool to the dance community, share its concept, demonstrate the results, and observe how dancers use the tool in their own partnering studies.
Author’s Note
And that was a comprehensive overview of everything I developed over the past four months. I hope it was clear and that you enjoyed reading about it as much as I enjoyed working on it! I sincerely thank you, reader, for taking a few minutes to learn about my work. I’m also grateful to GSoC and HumanAI for the chance to work on such a fascinating and challenging project. I hope you have a great day and, perhaps, enjoy implementing your own version of the pipeline or enhancing mine!