Week 1 — Dance like a professional, with ML

Çağatay Yiğit
BBM406 Spring 2021 Projects
3 min readApr 11, 2021

Would you like to be able to dance like a professional dancer without years of practice?

At the end of this blog series, we will be able to generate a video of an ordinary person dancing like a professional dancer, using two short clips: the ordinary person’s random movements, and the professional dancer’s dance moves. So, without needing years of practice, you will be able to dance a pro in seconds.

What is this blog post and who we are?

We are 3 students from Hacettepe University, Computer Science Department, and in the following weeks, we are going to share our progress of BBM406-Fundamentals of Machine Learning Course Project under the supervision of Erkut Erdem.

We would like to hear your comments about the blogs that we will share. Also, you can reach us at any time from:

Now, without further ado, let’s see in detail what our project will be about.

Motivation

Aside from “Because why not?”, video-to-video synthesis is an active research area because of its wide range of applicability in various fields of computer science. Human motion transfer, in particular, has a different value because it can also be used in the entertainment industry for the purposes of content creation, as well as to reduce production costs in film and music industries.

With these motivations, we will investigate human motion transfer using dance videos in this project and attempt to improve an existing model for this specific task.

Investigating Human Motion Transfer on Dance Videos

State-of-the-art approaches (that we know of) to this problem currently work best for a single person [1,2]. One such example is Chan et al.[1], where they use a combination of multiple simple ideas. As a starting point, we are going to use their architecture, which consists of a pre-trained pose estimator [3], a GAN for pose2image [4], and a GAN [5] for facial expressions. Even though this method gives promising results, there are still some limitations and weaknesses. One of them is the lack of generalization capability of this method. In the cases of varying target video contexts, like different persons, varying backgrounds, different light conditions, the performance of the model decreases. Using a slightly modified approach, we will attempt to improve the quality of the generated videos as well as the generalization capability of this model. Although we have some ideas, we are still investigating potential solutions to improve the model’s generalization capability. So, keep an eye out!

Here is the illustration of the method introduced by Chan et al. [1]
Here is the video of the method introduced by Chan et al. [1]

Dataset

For training purposes, we are going to use the dataset collected by Chan et al. This dataset consists of up to 17 minutes length 1920x1080 and 1280x720 videos. These videos will be used as target videos. Furthermore, for transferring the professional dance moves to target video we will also use a dataset collected from YouTube.

Evaluation Metrics

We are planning to use three different metrics. One is performing a human perceptual study to measure the realisticity of results. The others are Structural Similarity and Learned Perceptual Image Patch Similarity.

Work Plan

We want to give you (and us) an overview of the work plan which we follow over the next weeks.

In the first weeks, we will implement the method developed by Chan et al.[1]. By doing so, we will gain a deep understanding of the details of the existing method and the problem itself. After that, we will try to implement different approaches when training the model, and report our results.

See you next week!

References

[1] https://carolineec.github.io/everybody_dance_now/

[2] https://tcwang0509.github.io/vid2vid/

[3] https://arxiv.org/abs/1812.08008

[4] https://arxiv.org/abs/1711.11585

[5] https://arxiv.org/abs/1611.07004

--

--