Music Generation Using Deep Learning
(A deep learning Case Study)
1. Real World Problem
This case-study focuses on generating music automatically using Recurrent Neural Network(RNN).
We do not necessarily have to be a music expert in order to generate music. Even a non expert can generate a decent quality music using RNN.
We all like to listen interesting music and if there is some way to generate music automatically, particularly decent quality music then it’s a big leap in the world of music industry.
Task: Our task here is to take some existing music data then train a model using this existing data. The model has to learn the patterns in music that we humans enjoy. Once it learns this, the model should be able to generate new music for us. It cannot simply copy-paste from the training data. It has to understand the patterns of music to generate new music. We here are not expecting our model to generate new music which is of professional quality, but we want it to generate a decent quality music which should be melodious and good to hear.
Now, what is music? In short music is nothing but a sequence of musical notes. Our input to the model is a sequence of musical events/notes. Our output will be new sequence of musical events/notes. In this case-study we have limited our self to single instrument music as this is our first cut model. In future, we will extend this to multiple instrument music.
2. Representation of Music
Our key task is to represent music as a sequence of events. As we will be using RNN which takes sequence as an input.
Above image is a representation of music which is known as sheet music. Here, music is represented by a sequence of musical notes. Each musical note is separated by a space. This can be used to represent both single instrument and multi instrument music.
There are two parts in ABC-notation.
Part-1 represents meta data. Lines in the Part-1 of the tune notation, beginning with a letter followed by a colon, indicate various aspects of the tune such as the index, when there are more than one tune in a file (X:), the title (T:), the time signature (M:), the default note length (L:), the type of tune (R:) and the key (K:).
Part-2 represents the tune, which is a sequence of characters where each character represents some musical note.
MIDI (Musical Instrument Digital Interface)
MIDI itself does not make sound, it is just a series of messages like “note on,” “note off,” “note/pitch,” “pitch-bend,” and many more. These messages are interpreted by a MIDI instrument to produce sound. A MIDI instrument can be a piece of hardware (electronic keyboard, synthesizer) or part of a software environment (ableton, garageband, digital performer, logic…).
Above images shows the representation of music generated using Music21 which is a python library which generates MIDI format music. Here, it shows Event-1 which is basically Note B. Then Event-2 represents Chord E3 A3 then Event-3 represents Note A and so on.
In our case-study we will focus much more on ABC-notation because it is easy to understand and easy to represent music using just sequence of characters.
3. Char-RNN Model (High Level Overview)
Now since our music is a sequence of characters therefore the obvious choice will be RNN or variations of RNN like LSTMs or GRUs which can process sequence information very well by understanding the patterns in the input.
There is a special type of RNN called char RNN. Now our music is a sequence of characters. We will feed one after the other character of the sequence to RNN and the output will be the next character in the sequence. So, therefore, the number of output will be equal to the number of inputs. Hence, we will be using Many-to-Many RNN, where number of output is equal to the number of input.
In above image, many-to-many RNN with equal number of inputs and outputs are shown on right-side which is in the red box. Here, each green rectangle(middle) is RNN unit which is a repeating structure. A more detailed image of RNN is shown below.
In the above image, Xt is a single character at time ‘t’ which is given as an input to RNN unit. Here, O_t-1 is an output of the previous time ‘t-1’ character which is also given as an input to RNN at time ‘t’. It then generates output ‘ht’. This output ‘ht’ will again be given as input to RNN as ‘O_t-1’ in next time step.
This output ‘ht’ will be a next character in sequence. Let say our music is represented as [a, b, c, a, d, f, e,…]. Now, we will give first character ‘a’ as an input and expects RNN to generate ‘b’ as an output. Then in next time-step we will give ‘b’ as an input and expects ‘c’ as an output. Then in next time-step we will give ‘c’ as an input and expects ‘a’ as an output and so on. We will train our RNN such that it should output next character in the sequence. This is how it will learn whole sequence and generate new sequence on its own.
This will keep on going till we feed all of our inputs. Since, our music is a combination of many characters and our output is one of those characters so it can be thought of as multi-class classification problem. Here, we will use “Categorical Cross-Entropy” as a loss function which is also known as “Multi-Class log-loss”. In the last layer we will keep “Softmax” activations. The number of “Softmax” activation units in last layer will be equal to the number of all unique characters in all of the music in train data. Each RNN can be a LSTM which contains ‘tanh’ activation unit at — input-gate — which is a differentiable function. Therefore, this RNN structure can be trained using back-propagation and we keep on iterating it using “Adam” optimizer till we converge. At the end, our RNN will be able to learn sequence and patterns of all the musical notes that are given to it as input during training.
Now, you must be wondering how our char RNN generate new music? Once our char RNN model is trained, we will then give any one random character — from the set of unique characters that we feed to our char RNN during training time — to our trained char RNN, it will then generate characters automatically which will be based on the sequences and patterns that it has learnt during training phase.
In the above, image we give “C1” as the an input to our trained RNN. Note, that “C1” is a character which should be present in the set of the characters that we feed to our char RNN during training time. Now our trained char RNN will generate output “C2”. This “C2” output is feed back as input again to the trained char RNN. This will generate “C3” as an output. This “C3” output is feed back as input again to the trained char RNN and so on. Now we got new sequence of music [C1, C2, C3…]. This new characters of sequence is a new music generated by our trained char RNN which is based on the sequences and patterns that it has learnt during training phase.
4. Data Preparation
From first data-source, we have downloaded first two files:
Jigs (340 tunes)
Hornpipes (65 tunes)
We will feed data into batches. We will feed batch of sequences at once into our RNN model. First we have to construct our batches.
We have set following parameters:
Batch Size = 16
Sequence Length = 64
We have found out that there are total of 155222 characters in our data. Total number of unique characters are 87.
We have assigned a numerical index to each unique character. We have created a dictionary where key belongs to a character and its value is it’s index. We have also created an opposite of it, where key belongs to index and its value is it’s character.
How batches are constructed:
def read_batches(all_chars, unique_chars):
length = all_chars.shape
batch_chars = int(length / BATCH_SIZE)
for start in range(0, batch_chars - SEQ_LENGTH, 64):
X = np.zeros((BATCH_SIZE, SEQ_LENGTH))
Y = np.zeros((BATCH_SIZE, SEQ_LENGTH, unique_chars))
for batch_index in range(0, 16):
for i in range(0, 64):
X[batch_index, i] = all_chars[batch_index * batch_chars + start + i]
Y[batch_index, i, all_chars[batch_index * batch_chars + start + i + 1]] = 1
yield X, Y
Above code snippet is a function to create batches. Here, there are three nested loops. First loop denotes batch number. It runs every time when a new batch is created. Second loop denotes row in a batch and third loop denotes column in a batch.
5. Many-to-Many RNN
Above image shows Many-to-Many RNN. The yellow box is a single RNN unit where ‘C_i’ is a first character input given to RNN unit. For first RNN we also have to provide zero input which is nothing but a dummy input because RNN always takes current input and previous output as input. Since, we don’t have previous output for the first iteration so we input zero, only for first iteration. Now, we want our RNN to produce next character as output. So, the output of ‘C_i’ will be ‘C_i+1’ which is nothing but a next character in sequence. Now, our next input itself is a next character, and at the same time the output of previous input is also next character so we feed both of them to next RNN unit again and produces ‘C_i+2’ as output which is next character in the sequence. This is how Many-to-Many RNN works. Note, the above image shows time-unwrapping of RNN. It is only one RNN unit which repeats itself at every time-step.
The above image shows ’n’ RNN units in single RNN layer. We have constructed our single RNN layer exactly like this where we have 256 LSTM-RNN Units in one layer of RNN. At each time step all of the RNN units generate output which will be an input to the next layer and also the same output will again be any input to the same RNN unit. Here, in our project each RNN unit is an LSTM unit. In Keras library, in LSTM there is a parameter called ‘return_sequences’. It is False by default. But if it is True, then each RNN unit will generate output for each character means at each time step. This is what we want here. We want our RNN unit to generate output as the next character when given input as previous character in the sequence. We have stacked up many LSTM units here so that each unit will learn different aspect of the sequence and create a more robust model overall.
In above image I have just shown one RNN layer with ’n’ units. We have three such RNN layers each having 256 LSTM units. The output of each LSTM unit will be an input to all of the LSTM units in next layer and so on.
After three such layers of RNN, we have applied ‘TimeDistributed’ dense layers with “Softmax” activations in it. This wrapper applies a layer to every temporal slice of an input. Since the shape of each output after third LSTM layer is (16*64*256). We have 87 unique characters in our dataset and we want that the output at each time-stamp will be a next character in the sequence which is one of the 87 characters. So, time-distributed dense layer contains 87 “Softmax” activations and it creates a dense connection at each time-stamp. Finally, it will generate 87 dimensional output at each time-stamp which will be equivalent to 87 probability values. It helps us to maintain Many-to-Many relationship.
There is one more parameter in LSTM which is known as “stateful”. Here, we have given “stateful = True”. If True, then the last state for each sample at index ‘i’ in a batch will be used as initial state for the sample of index ‘i’ in the following batch. This is used because all of the batches contains rows in continuation. So, if we feed the last state of a batch as initial state in the next batch then our model will learn more longer sequences.
6. Model Architecture
def built_model(batch_size, seq_length, unique_chars):
model = Sequential()
model.add(Embedding(input_dim = unique_chars, output_dim = 512, batch_input_shape = (batch_size, seq_length)))
model.add(LSTM(256, return_sequences = True, stateful = True))
model.add(LSTM(256, return_sequences = True, stateful = True))
model.add(LSTM(256, return_sequences = True, stateful = True))
Now we want to predict the next character which should be one of the 87 unique characters. So, it’s a Multi-Class Classification problem. Therefore, our last layer is Softmax layer with 87 activations.
7. Music Generation(Results)
Now, we have already trained our model and find the best weights. Now our model is ready for making prediction. In order to make prediction, we will give any of the 87 unique characters as input to our model and finally it will generate 87 probability values through softmax layer. From these returned 87 probability values, we will choose the next character probabilistically and not deterministically and finally we will again feed the chosen character back to the model and so on. We will keep on concatenating the output character and generate music of some length. This is how music will be generated.
Above music sequences are some of the examples of music that we have generated from our model.
8. Further Scope
We have generated a good quality music, but there is a huge scope of improvement in it.
First, starting and ending music can be added in every new generated tune to give a tune a better start and better ending. By doing this, our generated music will become melodious.
Second, the model can be trained with more tunes. Here, we have trained our model with only 405 musical tunes. By training the model with more musical tunes, our model will not only expose to more variety of music but the number of classes will also increase. By this more melodious and at the same time more variety of music can be generated through the model.
Third, model can also be trained with multi-instrument tunes. As of now, the music generated is of only one piece of instrument. It would be interesting to listen what music the model will produce if it is trained on multi-instrument music.
Finally, a method can be added into the model which can handle unknown notes in the music. By filtering unknown notes and replacing them with known notes, model can generate more robust quality music.
10. Code Link of this Project
Full Code of this case study can be found here.
Related Posts from DDI:
Self-driving cars, Alexa, medical imaging - gadgets are getting super smart around us with the help of deep learning…www.datadriveninvestor.com