About a year ago, we got engaged and started planning our wedding. As part of the planning process we were going through the typical traditional wedding agenda and choosing which things we wanted to do — a toast (yup); bouquet toss (nope); a wedding party (yup). We weren’t sure at first if we wanted to do a first dance, but then we had an idea:
A first dance is a chance to celebrate something that reflects our relationship for the first time as a married couple.
So we decided instead of a dance. we would share a first dance project instead. We know it would be a little cringy, but then again so is a first dance if you can’t dance.
Ok, now let’s get technical.
A neural network is an artificial intelligence algorithm that allows you to teach your computer how to do things based on a bunch of examples. A classic example, shown above, is “teaching” (or as it’s more commonly called “training”) a neural network how to distinguish a picture of a dog from a picture of a cat. This is done by feeding a bunch of cat and dog pictures into the network, telling it which is which, and allowing it to learn which features of a picture correspond to which animal. There are lots of great descriptions of how neural networks actually work available online, depending on your level of mathematical background.
Rather than having a first dance, our plan was to train a neural network to dance for us!
Plan A: Choreography
Our wedding was on March 2, 2019. Around Christmas of 2018, we decided that we really needed to work on this project in order to have it done in time. So, we sat down at Starbucks and mapped out our first attempt.
Based on a recent paper, we wanted to use a recurrent neural network (RNN) to learn to choreograph a dance for the two of us. The idea here is that the RNN is fed many images of a choreographed dance. It is trained by learning to predict the next frame of the dance based on the previous frame (hence the name “recurrent”).
We grabbed some data from a wonderful “cha-cha” training video and extracted stick figures using a pre-trained neural network. We trained it for ~1000 iterations using our laptops (about ~10 minutes). Below is our one of our favorite results:
Obviously, this method didn’t work according to plan. But we didn’t give up!
Plan B: Steal Code!
At this point, it was February and our wedding was coming up fast. So we did what we know best: stealing better code from other people.
Luckily, a recent paper that garnered a lot of media attention had done a very similar project to what we wanted. “Everybody Dance Now” uses a special neural network called a generative adversarial network (GAN), which basically pits two neural networks against one another: one which generates fake images, and one which learns to identify real vs simulated images.
The idea is actually pretty simple. We (well, the people we stole the code from) train one neural network system to take an image of a person and extract a human figure from it. They then train a different neural network system to do the opposite (translate a human figure into a real image). So we can extract a human figure from a video of a professional dancer with the first neural network, and we make the other neural network turn these stick figures into videos of us dancing well!
Their paper is very snazzy and has some fun bells and whistles to make their dance videos really smooth. Unfortunately, their code wasn’t public. However, someone released a simplified version (based on the Pytorch framework) to github. This code was built using two really awesome opensource projects: Realtime Multi-Person Pose Estimation (which does the figure extraction) and pix2pixHD (which translates one images to another).
So we were in business!
Well, almost. There was one small hangup. We could no longer use our laptops to train the code, because we did not have accessible graphics processing units (GPUs) on them. Luckily, Ashley has access to Harvard’s supercomputer known as Odyssey. Odyssey has over 78,000 cores and 40 Petabytes (or 40 million GBs) of storage — and lots of GPUs to spare! We honestly don’t remember how many computational hours we used on Odyssey — but we estimate something like 24+ hours.
Below is the proof of concept result, using the great Napoleon Dynamite. The top left corner is the “Source Video”, i.e. the dance moves we want to transfer to ourselves. The top right corner is the “Pose Estimation” which is similar to the limb extraction we did in our first attempt. The bottom row is video output of Alex (on the left) and Ashley (on the right). You can see that it’s not ~quite~ as crisp as the “Everybody Dance Now”, but we’re pretty happy given that it actually ran, that we didn’t spend a thesis amount of time doing it, and that we had just learned how to use GPUs.
So without further ado, here is our first dance (set to music):