Audio Generation with GANs

Note: a portuguese version of this article is available at https://medium.com/@rafalencar/gerando-%C3%A1udio-com-gans-c8ce104e546a

Nowadays, one of the biggest applications in Deep Learning is Generative Models, that can be used to cluster the main aspects from a bunch of data and then create new data that looks exactly like the originals. Keeping that in mind, what could these models create using audio samples?


Machine Learning and Audio

When talking about machine learning for audio, the first thing that comes to mind is speech recognition, mainly known as Natural Language Processing. It is possible to see it working in our pockets, when using our smartphones personal assistants like Siri, Cortana and Alexa. There are other applications out of the speech field and coming to audio processing in general such as segmentation, which can identify and separate different musical instruments in a song. However, our focus in this article is to create new sounds.


Creating Sounds with GANs

GANs

This article main point is to make a small audio experiment using a famous tool, the Generative Adversarial Networks (GANs). All GANs work on training two different models:

  • Generator: given a random noise, it turns it into an audio sample;
  • Discriminator: given an audio sample, it tells me if it is real or fake one.

When training the GAN, each epoch will be the round of a game between the discriminator and the generator model. On the discriminator's turn, it will train its model to better distinguish fake audios from the real ones, on the other hand, on the generator's turn, it will train its model to better fool the discriminator and make it believe the synthetic audios are original ones. The image below shows this idea:

GAN model for image generation

Architecture

Both networks took Deep Convolutional GANs (DCGANs) as inspiration. They both have 4 convolutional 1 dimensional layers followed by ReLU activation function, and a fully connected layer with LeakyReLu activation function. Besides, there are Batch Normalization and Dropout layer between each convolution.

The main differences between each network are their inputs and outputs. The generated model has a 500 input shape and 64000 output shape (16kHz x 4 seconds), meanwhile the discriminator model has a 64000 input shape and a 1 output shape in range 0 to 1, with a 0 answer meaning the audio is fake. For more details on the implementation, the code written in Python using TensorFlow and Keras can be found in the link below:

Dataset

For training this networks, we used the Freesound-Audio-Tagging database, with more than 10 thousands audio samples. It is populated with uncompressed Pulse Code Modulated (PCM) with a bit depth of 16 and a sampling rate of 44.1 kHz, mono audio files, each one with a different duration. Besides, all samples are separated in 41 different categories such as Trumpet, Applause, Violin or fiddle, etc. You can check a Saxophone samples from this dataset below:

Saxophone audio sample

Each category length is between 94 and 300 samples. The main categories used for this experiment were “Saxophone” and “Violin or Fiddle”, due to the fact that most of their samples were manually classified. After that, all audio files were resampled to 16 kHz, set with a four seconds duration and then normalised.


Results

We trained the GAN for 50 epochs, which lasted 4 minutes. You can check both models' losses over epochs graphics below:

Discriminator loss function in red and Generator loss function in blue

And now, you can check the two generated audios based on the "Violin or Fiddle" category from the first and last epoch respectively:

Generated audio from the first epoch
Generated audio from the last epoch

It does sound more like a bagpipes than a fiddle. Does it mean that the experiment failed? Not necessarily. If you pay attention, you will notice that the first sound is much more noisy than the second one, giving the impression that the sound is cleaner. Now, lets try to figure out why the generated sounds are so different from the originals by looking to these sounds graphics:

Original audio samples
Generated audio samples

It is really hard to find the differences between the generated and original sounds only by looking to these images. One thing to notice is the silence parts in the original samples that could not be reproduced by the generator. The main difference can be seen if we zoom in these image and compare them.

To make a fair comparison, we will use an autoencoder based on the discriminator and generator networks to encode a sample and try to decode the same sample. After training the autoencoder and zooming in the original and the generated sounds, we get the image below:

Original audio on the left and recreated audio on the right

When checking this graphics, it becomes clear that the original wave looks much better than the generated one. This "wave shape" defines the timbre, which distinguishes different types of sound production. The generated wave looks like lots of random noise with no pattern, although it show a better shape when we zoom out. Therefore, to improve these model, we need to pay attention to the this sounds timbre and try to improve its generation quality.


Conclusion

This was a first experiment about audio generation with GANs and showed us how to use this tool for audio and the ways to improve for better results. For next steps, we must try different ways to improve the networks architecture for turning this synthetic wave into real. There are some ideas such as using recurrent neural networks (RNN) or applying frequency analysis technics for preprocessing the data. Scenes for the next articles.