Generating Original Classical Music with an LSTM Neural Network and Attention

Alex Issa
12 min readMay 21, 2019

--

By: Anushree Biradar, Michael Herrington, Alex Issa, Jake Nimergood, Isabelle Rogers, Arjun Singh

Overview

In this article, we will show our approach to generating classical music with repeated melodic structures using a Long Short Term Memory (LSTM) Neural Network with Attention. First, we will discuss our motivation behind wanting to generate classical music. Then, we will outline where we collected our data from, as well as the data preprocessing. After this, we will give a high-level background on RNNs, LSTMs, and Attention, and use that to explain our reasoning behind the model we chose. Finally, we will present the results of our models in various configurations and outline how we would continue the project given more time and resources. We built our model on top of an already existing project detailed here, and the source code for our project can be found here.

Motivation

There’s a lot of incredible progress being made in the data science industry with different kinds of generative models for NLP, image creation, and much more. After seeing popular YouTuber carykh’s video (embedded below) detailing how to generate original classical music, we decided we wanted to try to improve what he and others have done with newer technology. After talking with our professors about how we could improve current models and reading articles showing the power of recent text generation models that use Attention, we saw how we could improve on existing projects. Our goal was to create a model that is capable of generating unique classical music based on pieces composed by Mozart and Beethoven. We aimed to build on past projects by addressing the lack of repeating melodic structure through the addition of Attention to an LSTM network. However, we wanted to achieve this music without overfitting to our data.

carykh’s video that inspired us

Data Collection and Preprocessing

Data

For the sake of this project, we decided to gather a dataset comprised of Mozart and Beethoven piano concertos as they are easily recognizable to check for overfitting. Furthermore, most classical music is public domain, so finding a quality collection of MIDI files to use was relatively simple. We wanted two composers so that the model would have some variation to the style of music it’s trained on but felt that having many composers would lead to either a lack of style or clashing styles for the network to pick up on. Our group decided to only use music with a single instrument (piano) to reduce the complexity of our project. We collected the MIDI files from here and here, which comprises approximately 75 compositions used.

Preprocessing

In order to train the model, the MIDI files needed to be converted into a structure that we could easily encode into numeric data to feed the LSTM.

How we fed the data into the model was built on top of the code from the original article linked here (and in the Overview), which used a package called Music21. We essentially added the concepts of Rests and Duration (Rhythm) to the existing code, which only had Notes and Chords. The article breaks down the code line-by-line and visualizes it, but the basic idea was this: we used the Music21 library to take our MIDI files and convert them to a Stream object comprised of tempos, different voices, and note/chord/rest objects, with an associated instrument and duration. It then read through each MIDI and appended each note/chord/rest-duration combo to an array. That array was then split into 100-note samples. We also mapped the note/chord/rest-duration combo to unique numeric values since LSTMs understand numeric data more easily than categorical data. Those 100-note samples were then fed into the LSTM for it to train by predicting the next note and comparing its answer to what the actual next note was.

Technical Background

To accomplish our goal of classical music generation with repeated melodic structure, we had to decide between using an LSTM or Generative Adversarial Network (GAN) base structure. After doing research, we decided the best idea would be to use an LSTM model and incorporate Attention. There are a few reasons for this decision. We were more familiar with LSTMs than GANs, and both GANs and Self-Attention GANs are generally more difficult to successfully train. We also found much more success online with LSTMs for music generation, such as Project Magenta (Tensorflow/Google’s attempts on AI music generation) which uses LSTMs. To further break this down, let’s define what an RNN, LSTM, and Attention are in, hopefully, reader-friendly terms. We will talk about these models in terms of their ability to generate a contextual paragraph given a prompt (which is not a great use-case for bare RNNs or LSTMs), and that being able to do that well should in some way translate over to being able to generate structured music well.

General Recurrent Neural Networks are essentially models that have the ability to remember their most recent calculations and use them in addition to a new input to generate a new output. They use a feedback loop to allow for a small memory structure to be added to the model. That’s great for something like a basic problem where say I give the input “The water is,” and it would ideally say “blue.” But, if I give it a prompt and want it to continue writing a contextualized paragraph, an RNN would fail. This is essentially because of a mathematical problem called the “Vanishing Gradient,” where the gradient of the loss function decays exponentially over time. To generalize the math, taking the gradient of the loss function over and over (which lets models approach an ideal answer as time goes on) is one of the main ideas used in a massive variety of machine learning models. The problem here is that as our model keeps doing it, we forget more and more information faster and faster.

Unrolled sequential processing for RNN, from: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

The Long Short Term Memory Neural Network was developed to allow prior inputs to propagate further into the network, allowing for the concept of longer-term memory. This is accomplished by adding a specific “memory cell” to the structure, which has gates that decide when information enters memory, is output, and is forgotten. You can think of this as whenever the Network is calculating what output to give based on an input, it has the ability to decide what concepts of the previous results and new input are more or less important. From there, it bases its new answer off the more important previous outputs. Once the calculations are done, there’s a final fixed-length output that ideally contains all the information we need to finish that paragraph mentioned earlier. Problem solved, right? Nope. The two big issues here are the fixed length, which restricts the depth of understanding the memory can have, and the fact that the vector is a final calculation, meaning we can’t understand a portion of our input from any perspective other than the end of the input.

Sequential processing in LSTM, from: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Attention is a more recent development that actually helps solve our core problem. There are all these intermediate calculations an LSTM does to get to a final output, so what if we were able to remember those intermediate calculations? That would allow us to attend to certain components of the input at any given instant and use those components to help generate parts of the output rather than just the final calculation. So in something like generating a paragraph from a prompt, we’d have a lot of success if we used Attention, but let’s give a more contextual example.

Attention Mechanism Mathematical Breakdown, from: https://www.slideshare.net/KeonKim/attention-mechanisms-with-tensorflow

I’m sure we’ve all seen this joke before or maybe a more inappropriate version of it. You see, in something like reading, humans don’t actually read word to word (i.e. sequentially). We take shortcuts in how we process the data we read by attending to the context of different words and making assumptions based off of that. We group things together. That’s the idea of Attention, while the RNN and LSTM make the assumption that things work sequentially. Ironically, the LSTM and RNN wouldn’t have misread the image above, but we are trying to emulate human thoughts to make music, so I guess we want to make that mistake!

Hopefully, you can see how these concepts would translate into another complex language: music. There are many complex structures in music, like melodies, counter-melodies, harmonies, chord progressions, phrasing, voicing, repeated patterns, etc. So, we need the ability of Attention to contextually understand the music being written and then help guide the LSTM in what it needs to remember or forget as time goes on.

To finally explain our best model, we fed 100 notes of context input/noise (training/generation) into a Bidirectional 512 node LSTM, then into an Attention layer, then another 512 node LSTM, then to a 256 node Dense layer, then finally a ~3400 node Dense layer with softmax predictions. The ~3400 comes from how many unique possible note/chord/rest-duration combinations the input data had. Softmax means for each of those ~3400 options, the model assigns a final probability that the next note-duration to be played should be that specific combination, and the highest probability is played. To wrap up the technical background, we used categorical cross entropy for our loss function and rmsprop for optimization because that’s what most RNN-based models use. We also utilized the Dropout function at multiple points in our model to help prevent overfitting from occurring. Please check our GitHub for more formal coding details, packages we used, and so on.

Results

Different Model Breakdowns

Before getting to the final result discussed above, we had multiple other designs we tried that weren’t as successful. Again, all of our layers varied at the beginning, but they ended with dropout functions, a Dense 256, and a Dense ~3400 with softmax activation.

The first model we tried had layers in the following order: Bidirectional LSTM → Attention. This was trained on a subset of the data. We used this model as our starting point so that we could gain an initial perspective on the effects and expectations of using an LSTM layer followed by an Attention layer. Even though this model was trained on a subset of the data for time purposes, it began showing promising results after training for around 40 epochs. Therefore, we decided that adding more LSTM and Attention layers would further improve the quality of the music.

First model output after 1 epoch, loss around 6.1392. Clearly, it hasn’t learned much at all.
First model output after 21 epochs, loss around 3.6568. We start to hear small things like different notes and rhythms played. Still simplistic overall.
First model output after 40 epochs, loss around 1.7284. More progression on the notes and rhythms, but still no developed melodic structures. It’s a start!

The next model had the following layers: Bidirectional LSTM → Attention → Bidirectional LSTM → Attention → Bidirectional LSTM → Attention. The models that we had seen in the past used this structure without Attention to successfully generate music (both carykh and the source article we linked above), so we decided that it was worth trying on our dataset. We trained the full dataset on it and observed that after 15 epochs, the loss had barely dropped and was starting to rise again. Each epoch took around an hour to go through on Paperspace’s 2nd best server instance, so we decided to abandon that model.

The next model that we tried had one less layer of Bidirectional LSTM and Attention and was trained on the full dataset as well. This configuration faced the same lack of progress and high time to compute problems as the previous model, so we abandoned it as well.

After this, we decided to try the one layer of Bidirectional LSTM and Attention again but with the full dataset. The loss for this dataset went down consistently for the duration of the model’s run. After about 60 epochs, the rate at which the loss decreased started to slow down. After 85 epochs, we could see that the loss was going to stagnate at around 1.3, so we decided to stop running this model.

For the final model, we used the following layers: Bidirectional LSTM → Attention → LSTM. The second layer of LSTM was intentionally not Bidirectional. We decided to choose these layers because we thought that after the LSTM and Attention initially made connections with understanding the data, it needed another layer to further develop the ideas it discovered before going to the Dense layers. The loss for this model went down the fastest of any of the models. Please listen to some of the samples for this model!

Final model output after 30 epochs, loss around 0.5336. Already drastically better than any previous model output, with some ideas of melodic structures shown.
Final model output after 40 epochs, loss around 0.2932. We can really hear the repeated melodic patterns we wanted, in addition to the chord progressions, key signatures, and so on.
Final model output after 60 epochs, loss around 0.1511. This is arguably the best sounding output we generated, but with the loss as low as it is, there’s a good chance this is somewhat overfitted.
Final model output after 70 epochs, loss around 0.1172. If this sounds familiar, it’s because it is an overfitted slight variation of Beethoven’s Moonlight Sonata, but that’s unsurprising given how low our loss is.

As you can hear, our model not only captures the “sound” we were looking for from Beethoven and Mozart but also the repeated melodic structures that plain LSTM models can’t achieve. That being said, it’s clear that we overfit at the later models, as discussed in the comments for the videos. We speculate on how to reduce this in future training attempts in our Conclusion.

Please download our GitHub repo to see more midi output samples.

Restrictions and Comparison

Due to the amount of data needed by the model to generate something unique with less overfitting, it was necessary to train our model on PaperSpace’s 2nd best server option. Because of this computational cost, some of the models we trained on had to have reduced functionality in order to see the effects of various configurations on the output. This includes training on a smaller subset of the data, training for less epochs, etc. Training each model on our full dataset would require a minimum of 24 hours of training each, which is prohibitive. As a result, we only did a full run on our final two models after seeing the effects of each configuration on the output.

Because of this limitation, it is difficult to compare the models normatively as they were trained on different datasets with various epochs. Thus, while normative data is included, comparing the different model outputs is best done empirically.

Our most successful model had one Bidirectional LSTM layer, followed by a layer of Attention, followed by one more LSTM layer. We think that adding many layers did not work as well because the complexity of those models exceeded the complexity of the elements we wanted our model to learn. If we kept our larger models going, we think that the models would have eventually had the loss go down but would have overfitted. We can also confirm this by listening to the source article, where it seems pretty clear that their model overfit the data.

Conclusion and Future Ideas

In conclusion, Attention helped our models learn and generate repeated melodic structures.

If we were to take this project further, one of the things we would like to address is the overfitting present within the “best” version of our model at its later epochs, where we let the loss get extremely low. We suspect that adding more classical music to the data set from different composers would allow us to reduce this overfitting. At the same time, this could result in making it difficult for the model to emulate a specific composer’s unique style. The addition of training data requires more time and significantly more money too, which makes expanding on the project difficult in that regard. On the other hand, the introduction of additional data may provide the model with deeper insights on the artists’ composition techniques, resulting in more authentic sounding music.

Furthermore, we would like to explore model training on both new model designs, as well as our less successful models with more computing power and time to see how they will perform. It’s possible that the models will perform significantly better with more data and time to develop. This would be ideal to truly do a comparison on which model design is most effective. There are also many hyperparameters we could tune throughout the model, so there are loads of possibilities we were restricted from trying. On top of that, comparing the effectiveness of these models to SAGANs would be an interesting project.

Other slight adjustments we could have made include saving more epoch weight files to detect overfitting/progress/lack of progress sooner. We also should have filtered out irrational durations from our output options (things like rests for 2000 beats). Additionally, we wanted to add start/stop encodings to see if the model could learn to generate entire succinct pieces, rather than just a preset amount of notes. Last, we could’ve changed our input size into the model to see the effects of giving the model more or less context to figure out what’s an ideal amount of context for the music we have. As you can see, there are many things we could do to expand on this project in order to see if it’s possible to get it onto the level of an excellent human composer without overfitting. Lastly, other original projects could also be done on more complex instrumentation, different music genres, and so on.

References:

--

--

Alex Issa

I’m a CS student at U.T. Austin. Hoping to write more on AI, hockey, politics, and whatever else comes to mind!