Generating Music with a Generative Adversarial Network
Written by Charles Robert Misasi Jr, David Zehden, Thomas Wei, Liangcheng Zhang, Antonio Perez, & Sam Kaeser
Introduction
Generative Adversarial Networks (GANs) have become extraordinarily popular in recent years due to their success with image generation. Websites like thispersondoesnotexist.com showcase the capabilities of GANs in generating extraordinarily realistic human faces. Given their positive results regarding image generation, we sought to find out if GANs could be applied to musical compositions with a similar outcome.
Background
For some context, let’s briefly examine what a GAN actually is. GANs consist of two neural networks with conflicting goals, namely a discriminator and a generator. The discriminator has the task of determining whether or not input it is given is “real” or “fake”. The generator is challenged with creating authentic-looking content that fools the discriminator into believing it’s real. The idea is that when one of these networks gets better at its job, the other network has to learn how to better counteract its adversary. This feedback loop results in obtaining better and better generated content.
Time-Frequency Data Representation
With this brief introduction to GANs out of the way, we’ll take a look at how we applied this concept to music. With images, data representation is relatively straightforward; images are just 2-dimensional arrays with a number of color channels (e.g. 1 channel for greyscale or 3 channels for red-green-blue). However, music is structured differently than images. A single song can have multiple instruments playing their own part at any time. This introduces a significant amount of variability that must be captured, certainly too much to for a single 2D array. To make this task more feasible, we only used songs from the classical music genre with a single track and we fixed the amount of data that we used from each song.
In order to effectively utilize convolution within a GAN, the data used must maintain translational invariance. In order to facilitate this, every musical training input was extracted as a 16 beat segment from a song, where each beat was divided into 24 time slices. Each slice contained a vector of size 128 to hold the volume of each possible note that could be played. This resulted in our discriminator input matrix (and generator output matrix) being of size 384 x 128. Songs could be sampled multiple times to produce additional training samples, with some danger of overlap between samples. These transformation steps combined with the data filtering discussed in the paragraph above reduced our original input dataset of about 113,000 midi files to roughly 6,000.
Our GAN
Convolutional models can be notoriously tricky to train, so we studied examples such as those provided here for guidance. We needed to strike a balance between the additional complexity introduced with deeper layers plus larger filters with the model’s potential to underfit the data and produce noise. We also needed to find a training heuristic that would help avoid problems like non-convergence and mode collapse. Keeping these ideas in mind, we came up with the following architectures for our generator and discriminator networks.
The generator element in our GAN takes a vector of 100 random real numbers as input and feeds this through 5 hidden layers to produce the output song. Each input is selected from a normal distribution with a mean of 0 and a variance of 1. The hidden layers in the network are organized like so: The first is a fully-connected layer, whose output is reshaped to (6, 8, 256). This is then fed to a convolutional transpose layer using a (5, 5) filter, followed by a third convolutional layer using a (4, 4) filter. Layers 4 and 5 both use (4, 2) filters to ultimately output a Pianoroll matrix of shape (384, 128). The last layer in the generator is a Relu activation layer which limits each cell in the output to be between 0 and 2. Every other layer uses a modified form of ReLU activation in every convolutional layer, which acts as a pass-through activation for all positive outputs, but reduces the value by a factor of 3 if its negative. We also use batch normalization between each layer to help control the magnitudes of the weights.
The discriminator works in a reverse fashion. It takes the (384, 128) song as input, and feeds this into a network of 3 convolutional layers to output a single scalar representing the probability that this input is real. The first layer uses a (4, 2) filter to create an output of shape (96, 64, 32). This is fed to the second convolutional layer with a (4, 4) filter and then to the third layer with a (4, 4) filter. This final layer is fed to a fully connected layer with a single output: the class estimate. The same modified ReLU activation seen in the generator is used for each convolutional layer, and a sigmoid activation is applied to the final fully-connected layer. We also utilize dropout layers between each convolutional layer, which randomly reduces 30% of the inputs to 0 to help it generalize to new data.
We applied cross-entropy as our loss functions as the generator loss is based on how well it is able to trick the discriminator into identifying the fake song as real, while the discriminator loss is a sum of how well it identifies both real and fake songs as their respective classes. Both models use the Adam algorithm as its gradient optimizer, but the discriminator uses a learning rate of 1e-6, while the generator uses a learning rate of 1e-4.
Training, Results, & Evaluation
We trained our model on a randomly selected batch of 200 of the 6,000 input samples for each of 10,000 epochs. We saved a generated song for examination every 250 epochs for later analysis and visualization. At the conclusion of training, our model demonstrated that it was able to pick up some structural details early on, but failed to generalize and ultimately was unable to produce compelling results.
The generator produced random noise after the first training epoch, which aligned well with our expectations. After the first 250 epochs, some musical structure became clear as notes began to be played at specific times in vertical sequence with other notes. However, this pattern would not persist, and as the remaining training epochs progressed the output further resembled random sounds played at random times, albeit at a lower density than when it started.
We were unable to find more conclusive results given time constraints on completing this project, however there are several key ideas we would use to improve our model if we were to revisit this project in the future. First, the filter sizes used by the discriminator are likely much too small to capture any large structural patterns in the music. These would need to be scaled up significantly to find patterns that persist through each sample. Second, we may not be using enough hidden layers in both models to capture and reproduce the complex design of music. Adding more convolutional layers could potentially improve performance. Third, our data could be selected with more strict criteria, such as enforcing a 4/4 time signature, removing samples which contain key changes, and starting each sample on the first beat of a measure.
Final Thoughts on GANs for Music
Synthesizing music with a GAN is a challenging task, but we found that even as simple of a model as ours was able to pick up on the structure of music and propagate that structure into artificial music. Even though our music was easily distinguishable from it’s professionally composed counterparts, a model with more complexity and a stronger learning ability could eventually produce a beautiful piece of music.
Discrete Token Data Representation
Another direction that we explored was representing the notes in monophonic music as sequences of discrete tokens, rather than as part of a continuous frequency and time space. This alternative representation allowed us to reason about our data in different ways, such as building discrete empirical distributions for events conditioned on features of previous notes in the sequence. It is also closer to the way that the MIDI file format stores information, and consequently, it was easier to convert between the two.
N-gram
The first type of generation that we tried, using this representation, was an n-gram style approach. Since notes have more well defined and related features than words, it was also possible to define distributions conditioned on not only entire notes but also on features of notes such as pitch, duration, and volume. This was useful, because as n grew larger, the empirical distributions that we were collecting for each (n — 1) sequence of notes got much smaller. As a result, the potential of memorizing portions of songs rose. Building distributions for the pitch and duration patterns of the (n — 1) previous notes separately allowed us to mitigate this problem, but the outputs of these models were quite chaotic, even at larger n’s (n = 4, 5). Overall, although the melodies produced by our n-gram models were chaotic and quite random, the experience and tools we developed in the process were useful down the road.
SeqGAN
We were able to produce melodies with far more logical sequences of pitches by training SeqGAN on our token representation of notes. Traditional GANs struggle with learning to generate discrete sequences, since there may not always exist a “more realistic” output in the direction of the negative gradient of the loss [5]. This is because the output space is discrete. It is also hard to train a classifier to accurately distinguish between real and fake sequences at various stages of completion. SeqGAN remedies both problems by reinterpreting the generator as an agent and the classifier as the source of the reward signal [5]. The reward is then propagated back to the generator through a combination of policy gradient and Monte Carlo search [5].
We first applied SeqGAN to a reduced-complexity problem, where each note of the sequence was of uniform duration. With this reduced-complexity problem, SeqGAN was able to learn musical structures such as key signature and partial scales within 10 epochs of adversarial training. The original SeqGAN paper included similar experiments with music generation; however, they were only focused on the fitness of the relative order of pitches in the music they produced. As such, our next course of action was to expand the scope of the problem to include varying duration in note length.
We tried two main approaches to integrate variable note duration into the output of our model. The first one was to apply SeqGAN on an expanded vocabulary where each note, played for a specific duration, was a distinct token. From listening to sample output, it seemed that with the expanded vocabulary, it was taking far longer for SeqGAN to learn the same musical structures as it had before. This motivated our second approach, which combined the fixed-duration pitch information from our previous SeqGAN model with a method of assigning durations to each frequency, similar to the n-gram approach. To do this, we first generated fixed duration pitch sequences as before, and then we assigned durations sampled from the distribution:
P( length(i) | pitch(i), length(i — 1) and pitch(i — 1), … , length(i — n) and pitch(i — n) )
With the second method, we were able to create realistic note sequences in a fraction of the time it took to train SeqGAN with an expanded vocabulary.
We also considered extending our melody generation by adding chords or harmonies to the generated melody. We attempted to use the melody as an input to a many to many model such as an RNN or LSTM. To create training data for a model of this type, we annotated melodies with the chords played with each note in the melody. Unfortunately, we did not have the time to create and train a model to learn how to annotate melodies with chords before the due date.
Conclusion and Future Work
We have explored and evaluated the generation of music using a Generative Adversarial Network as well as with an alternative method in the form of an N-gram model. Our GAN is able to capture some of the structure of single track music. We have accomplished our goal of identifying structural similarities shared across music compositions. However, the music we created lacks coherent melodies and needs improvement. Future steps for our GAN model involve trying various filter sizes to optimize results and cleaning our dataset so that it contains only songs of the same tempo and time signature. Additionally, we would like to explore other models for music generation, such as LSTMs, since notes being played may be determined in part by previous notes and we need the concept of ‘memory’ in learning to generate music.
References
[1]”GAN: From Zero to Hero Part 1 — cican”, cican, 2019. [Online]. Available: http://cican17.com/gan-from-zero-to-hero-part-1/. [Accessed: 18- May- 2019].
[2]”Papers”, MuseGAN, 2019. [Online]. Available: https://salu133445.github.io/musegan/papers. [Accessed: 18- May- 2019].
[3]”GAN — Why it is so hard to train Generative Adversarial Networks!”, Medium, 2019. [Online]. Available: https://medium.com/@jonathan_hui/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b. [Accessed: 20- May- 2019].
[4]”Generative Adversarial Networks” Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. (2014). Generative Adversarial Networks. [online] Available at: https://arxiv.org/abs/1406.2661 [Accessed 13 May 2019].
[5]”SeqGAN: sequence generative adversarial nets with policy gradient.” Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI’17). AAAI Press 2852–2858.