Adversarial Audio Synthesis
GANs have been used to generate high quality images and videos for very long.
Their applicability when it comes to auditory data has been a focal point in research lately with one state of the art model after another.
While the process requires extensive training, the results are note-worthy.
All these processes are unsupervised and can be used to generate human-like speech or music.
I will be explaining the following three papers.
- (WaveGAN) Adversarial Audio Synthesis — ICLR 2019
- GANSynth — ICLR 2019
- MelGAN — NeurIPS 2019
WaveGAN
- The paper is based on the novel DCGAN infrastructure (generates images).
- It changes the properties of the model that catered to image data to that of audio waves.
- Since images can be thought of as 2D matrices, audio can be thought of as a 1D array and all operations were changed accordingly.
- All 5x5 2D convolution operations were changed to 25 1D filters.
- Phase Shuffle: The standard transposed convolution was avoided because it led to the generator learning a trivial policy of discarding artificial frequencies and made the optimization harder. Phase shuffle was introduced in replacement.
Phase shuffle randomly perturbs the phase of each layer’s activations by −n to n samples before input to the next layer. This makes the discriminator’s job harder by requiring in-variance to the phase of the input waveform.
The paper uses nearest neighbor approach and up-sampling layers, a factor of 4x is applied in both the discriminator and the generator.
I personally trained the model on the Tesla K80. The model required at least 20,000 steps to produce results that were recognizable, however, some noise was still present as the accent variation was too high to converge.
GANSynth
- The architecture was used to make music and similar sounds.
- GANsynth generates an entire sequence of audio in parallel ~ approximately 50,000 times faster than a standard WaveNet.
- GANSynth generates the entire audio clip from a single latent vector, allowing for easier disentanglement of global features such as pitch and timbre.
- GANSynth uses a Progressive GAN architecture to incrementally up-sample with convolution from a single vector to the full sound.
- The paper notes that the standard method of convolution up-sampling struggles because it does not align the phases well.
- The paper instead uses instantaneous frequencies.
It is the derivative of the angular difference between the frame stride and signal periodicity.
High fidelity audio was produced as a result.
MelGAN
- A non-auto regressive model produced by Lyrebird AI.
- The model is extremely light and incredibly fast.
- It is perhaps the closest model to be deployed in Text-to-speech systems.
- It is the first paper to successfully convert spectrograms to speech without any perceptual loss functions.
- They break down the problem in two steps.
- Modelling a lower-resolution representation such as a mel-spectrogram sequence conditional on text
- Modelling raw audio waveforms conditional on that mel-spectrogram sequence (or another intermediate representation)
The Mel-GAN being non-auto regressive means it is not limited to producing a single audio at a time and has no causal dependency on previous blocks of audio.
- They use dilated convolution blocks which make the process completely parallelizable.
- They use a scheme of three discriminators which are window based.
A window-based discriminator learns to classify between distributions of small audio chunks.
MelGAN has clearly redefined the SOTA standards.
Hoping to see more papers in 2020!

