Conditional WaveGAN Explained

5 min readFeb 6, 2019

Conditional WaveGAN presented at WiML Workshop

A lot of things happened after my participation in Deep Learning Camp Jeju last summer. First and foremost, I graduated high school and started receiving acceptance letters from colleges. But even without the familiar “the high schooler who does deep learning” title, my passion for deep learning research is still on fire. In fact, these days I am working as an intern at Naver’s Clova AI Research. Surrounded by Naver’s mindboggling resources and experts, I couldn’t be happier.

Meanwhile, I was able to round off my work with Anoop Toffy from DL Camp. Our paper, “Conditional WaveGAN,” was presented at Women in Machine Learning Workshop and received a bit of attention from the public. Granted, the fact that one of its authors was a high schooler was surely a major attraction point. But I believe that my last TPU tutorial at Medium and the ICLR acceptance of our paper’s baseline work WaveGAN (Congrats, Chris!) played a bigger role. So I thought this would be a nice time to do a little recap of our paper. Here is the link to our full paper.

1. Introduction

Generative adversarial networks (GANs) are being widely used for synthesizing realistic images. But very little has been explored in the area of audio generation. A few research works have been made in the area of unsupervised generative models in audio. One of them is WaveGAN, which trains a generative model in an unsupervised setting. In this work, we use WaveGAN model as our baseline model. Audio samples generated from WaveGAN are human-recognizable and have relatively good inception scores. But the samples generated are completely random.

In this work, we explore a way to generate audio samples conditioned on class labels. That is, given a class label whether the generator of GAN can generate a particular audio waveform. In the history of GAN, such type of conditioning has been explored in the synthesis of images. Conditional Generative Adversarial Nets introduced in 2016 by Mehdi Mirza et. al was the first to try concatenation based conditioning. The conditional GANs were able to synthesize realistic images on MNIST and MIR Flickr datasets.

2. Architecture

Baseline System

In the recent work on speech synthesis, Chris Donahue et al. (2018) introduces two GAN models, WaveGAN and SpecGAN. WaveGAN works in the time-domain and SpecGAN works in the frequency-domain. WaveGAN can not only produce intelligible words from a small vocabulary of human speech but also synthesize audio from other domains such as bird vocalizations, drums, and piano. It uses GANs in an unsupervised setting. WaveGAN is based on the DCGAN architecture, which became famous for its usage in image synthesis. We focus on this work to make WaveGAN conditioned on class labels by introducing various conditioning techniques discussed in the following sections.

DCGAN vs. WaveGAN (source: WaveGAN paper)

Proposed System

In this work, we explore a few conditioning mechanisms discussed in Dumoulin et al. article in this section. We are interested in generating raw audio of a given class label from the generator output. That is, the model takes as input a class and a source of random noise (e.g. a vector sampled from a normal distribution) and outputs a raw audio sample for the requested class as shown below.

The first approach is to embed the class information to the input feature vector. That is, before training, we would concatenate a representation of conditioning information to the noise vector and use it as the model’s input as shown below.

In the other type of conditioning we implemented, we scaled the hidden layers based on the conditioning representation and multiplied it with the input vector for both the discriminator and the generator input. Note that the scaling is applied to each layer of the convolution model.

3. Training

We overcome the known instabilities of training Conditional GANs by applying various hyperparameter tuning methods, as introduced by Kurach et al. 2018. Using a DCGAN loss setting with batch normalization, we successfully obtained human-recognizable samples on the Speech Commands dataset by Warden et al. 2018. Between the two proposed approaches for conditioning, we found that only the scaling-based method was able to achieve accurate conditioning and high fidelity of audio samples. This was because, in the concatenation-based method, the concatenating vector that contains the class information was too short to convey meaningful information — when compared with the lengthy (about six thousand times long) audio input. We further investigate the effect of hyperparameters and conditioning methods, which contributed to the performance of our method.

Here are some samples of generated outputs.

Google Colaboratory

Edit description

colab.research.google.com

4. Reflection of DL Camp Jeju 2018

This month-long camp in Jeju was truly an invigorating experience for everyone who participated. For me, this was the first time ever to work as an independent researcher. Here, we were not forced to listen to lectures and take notes. Instead, we were pushed to research our own project in the most suitable environment possible. The knowledge gains came naturally along the process. If the next year round of DL Camp opens up (hopefully), do not hesitate to apply because the camp will be one of the most cherishable research experiences in your lifetime.

References

Conditional WaveGAN: paper, code
WaveGAN: https://arxiv.org/abs/1802.04208
Deep Learning Camp: http://jeju.dlcamp.org/2018/
Conditional GAN: https://arxiv.org/abs/1411.1784
The GAN Landscape: https://arxiv.org/abs/1807.04720