Reincarnating Mozart With Tensorflow

Published in

Developer Student Club, HIT

10 min readApr 18, 2020

Wolfgang Amadeus Mozart (https://www.classicfm.com)

The Inspiration

There are a very few things that move and inspire people like the works of Classical Maestros (note: the term classical from here on refers to European classical art music from the 1750s to early 1820s) such as Mozart, Beethoven, Bach, Debussy, Chopin and so many more, and since their demise the world has seen a downfall of Classical Music. So why not bring them back with the help of our friendly neighborhood Neural Networks. If this works out we’ll call it the first AI aided renaissance and maybe rewrite the script of Terminator where Skynet ends up being a grumpy old classical musician.

As a musician I’ve always taken pride in the fact that while other professions “lose their importance” with progresses made in the field of AI, art will always remain a man made task, or will it?

Here is the GitHub link to all the code for this project if you have trouble following and if you’re here just to enjoy some sweet AI music skip to the end!

Background

Before we get into the building neural networks lets first get the basics straight

Data

We’ll be using the Mozart data from the Classical Music MIDI data set to train our network. MIDI or Musical Instrument Digital Interface is a standard way for digital transmission of musical instrument data. Technically, its a list of notes, when to play them and for how long. In our data there are 3 different kinds of data points, notes, chords and rests. Notes are the building blocks of music in western music theory and we have 12 of them, namely A, A#, B, C, C#, D, D#, E, F, F#, G and G#, each occur at multiples of a certain frequency for instance the middle C or C4 on a piano has a frequency of 261.6 Hz. When we bunch a few notes together (in a musical sense) we form a chord. Lastly rests are, as the name suggests, moments of silence in the piece. We omit the rests from the data to make our problem simpler.

Music21

Music21 is a Python toolkit used for computer-aided musicology. It allows us to teach the fundamentals of music theory, generate music examples and study music and we’ll be using it to covert our midi data into something our network understands. Think of it as a translator between two unspoken languages.

Recurrent Neural Networks

Recurrent Neural Networks or RNNs are a class of feed-forward artificial neural networks that have internal memory. Here the output of the current input depends on the past computation (kind of like sequential electronic circuits if you come from an electronic background). After producing the output, it is copied and sent back into the recurrent network. For making a decision, it considers the current input and the output that it has learned from the previous input.

Keras

Keras (for this project) is TensorFlow’s high-level API for building and training deep learning models. We’ll also be using its Tokenizer class to convert our list of notes to tokens which are easily “understood” by our network.

Preprocessing

Our midi data once passed through the Music21 parser will be a list of chord and note classes. Something like this

.
.
.
<music21.chord.Chord F#4 D4>
<music21.note.Note F#>
<music21.note.Note E>
<music21.note.Note F#>
<music21.note.Note A>
<music21.chord.Chord E4 G4> 
<music21.note.Note G>
.
.

While it somewhat makes sense to the human eye, our network won’t like it one bit so to please our network we’ll need some coding magic. First, lets get rid of all the unnecessary words and turn this into a list of notes and chords with the following code:

notes=[]count = 0for file in glob.glob('classical-music-midi/mozart/*.mid'):
    count += 1
    print(count)
    midi = converter.parse(file)
    note_in_song = None    parts = instrument.partitionByInstrument(midi)
    
    #check if there are different parts for separate instruments
    if parts:
        notes_in_song = parts.parts[0].recurse()
    else:
        notes_in_song = midi.flat.notesfor element in notes_in_song: 
        if isinstance(element,note.Note):
            notes.append(str(element.pitch))
        elif isinstance(element,chord.Chord):
            notes.append('z'.join(str(n) for n in element.normalOrder))

We get all the midi files in the Mozart folder and we parse it using Music21. If we encounter a note we put it in the list in its alphabetic representation such as ‘C4’, ‘A#3’ etc. If we encounter a chord we put it’s notes in numeric format separated by a ‘z’ such as ‘9z1z4’, ‘1z3’ etc. Notice that we are only looking for chords or notes so all other classes like tempo, rest etc are filtered out, we can take that up in a future project to improve the music sense of our network. In the end we get a list like this:

[...'A5', 'D3', '9z1z4', 'E2', 'E3', '9z1z4', 'E3', 'E3', '8z11z2z4', 'E2', 'E3', '9z1z4'...]

Now we tokenize our notes, i.e assign an integer to each note and chord in our data. We could do this manually but keras has the Tokenizer class to help us out with that. We first need to convert our list into a string of notes and chords as tokenizer is meant to be used on sentences and not list of words.

music_sentence = [" ".join(str(item) for item in notes)]print(music_sentence)tokenizer = Tokenizer()
tokenizer.fit_on_texts(music_sentence)
total_classes = len(tokenizer.word_index)+1
print(total_classes)
sequence = tokenizer.texts_to_sequences(music_sentence)[0]

We get a sequence of integers that represents our data in a form that the network understands. If we print out the total_classes variable here we see there are 179 different classes of objects in out data. Sounds like a lot but our model can tackle it without breaking a sweat (NLP tasks usually have thousands of different classes).

Now we split the data into inputs and labels. So, lets pause for a second and think this through. Unlike traditional classification tasks where there are separate inputs and labels present, here we have a constant stream of data. One way of tackling this is to take the first ’n’ objects from the list and make the network predict the (n+1)th object. Then we take the next ’n’ objects and repeat the same. That’s our plan, now lets write some code again:

sequence_len = 100network_in = []
network_out = []for i in range(0,len(sequence)-sequence_len):
  sequence_in=sequence[i:i+sequence_len]
  sequence_out=sequence[i+sequence_len]
  network_in.append(sequence_in)
  network_out.append(sequence_out)network_in = np.array(network_in)
label = ku.to_categorical(network_out,num_classes=total_classes)

We take a sequence of 100 here and make the 101th element our label. We convert the input to a numpy array and the one hot encode our labels so now they’re matrix of integers where all the elements are zero except for the ones at the index which matches with the number at the output. And with that we come to the end of our preprocessing job.

Building and Training

Now we come to the part which is easy to code but difficult to grasp, building and training our neural network. I went through a bunch of models and ended up using this one, which is a pretty common NLP model, because it was just easier to understand and explain, and that’s really important when it comes to building neural networks. So here’s the star of the show:

model = Sequential()
model.add(Embedding(total_classes,100,input_length=sequence_len))
model.add(Bidirectional(LSTM(150, return_sequences = True)))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_classes/2, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(total_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
################################################################Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 100, 100)          18000     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 300)          301200    
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 300)          0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               160400    
_________________________________________________________________
dense_2 (Dense)              (None, 90)                9090      
_________________________________________________________________
dense_3 (Dense)              (None, 180)               16380     
=================================================================
Total params: 505,070
Trainable params: 505,070
Non-trainable params: 0
_________________________________________________________________
None

The Model

Embedding: This layer takes our input and embeds it onto a dense vector. There are many popular methods of learning word embedding (considering our notes and chords to be words) such as word2vec and GloVe.

LSTM: is a Recurrent Neural Net layer that takes a sequence as an input and can return either sequences (return_sequences=True) or a matrix.

Bidirectional LSTM are an extension of traditional LSTMs that can improve model performance on sequence classification problems. In problems where all timesteps of the input sequence are available, Bidirectional LSTMs train two instead of one LSTMs on the input sequence. The first on the input sequence as-is and the second on a reversed copy of the input sequence. This can provide additional context to the network and result in faster and even fuller learning on the problem.

Dropout: A simple and powerful regularization technique for neural networks and deep learning models. We use it here to prevent overfitting.

Dense layers or fully connected layers is a fully connected neural network layer where each input node is connected to each output node.

Ours is quite a humble model which is more like a proof of concept than state of the art. We use categorical cross entropy since this is a multi-class (here the classes are all the notes and chords in our data) problem.

Training

Let us first train our model for 100 epochs to see whether its even worth our time or not. We’ll train it, save the model so that we can get the trained weights whenever needed and then we’ll analyse it using a plot:

history = model.fit(network_in,label,epochs=100,batch_size=512,verbose=1)model.save('Models/epoch100Test')acc = history.history['accuracy']
loss = history.history['loss']epochs = range(len(acc))plt.plot(epochs, acc, 'b', label='Training accuracy')
plt.title('Training accuracy')plt.figure()plt.plot(epochs, loss, 'b', label='Training Loss')
plt.title('Training loss')
plt.legend()plt.show()

This gives us an accuracy of 0.7802 and a loss of 0.9167 which isn’t great but once you look at the graph, it looks promising and graphs don’t lie:

Trained for 100 epochs with batch_size=512

You can see both the training accuracy and loss haven’t achieved their minimums or become asymptotic.

Our next step will be to train it for a larger number of epochs and maybe reduce the batch_size number. Lets go with 200 epochs with batch_size of 256.

Trained for 200 epochs and batch_size of 256

This time we get an accuracy of 0.9285 and a loss of 0.3151. Quite a huge jump in accuracy so now lets get down to creating a new Mozart with it!

Generating Music

This is what we’ve been waiting for all along, the final test! But wait we need there’s still one hurdle left, converting everything back to midi. We’ll need to make a dictionary to convert our tokens back to notes and chords, then finally we can take all that and convert to midi.

note_to_int = tokenizer.word_index
int_to_note = {v:k for k,v in note_to_int.items()}start = np.random.randint(0,len(network_in)-1)pattern = network_in[start]
prediction_output = []for i in range(500):
  prediction_in = np.reshape(pattern,(1,len(pattern),1 ))  prediction = model.predict(prediction_in,verbose=0)  pred_index = np.argmax(prediction)
  result = int_to_note[pred_index]
  prediction_output.append(result)  pattern = np.append(pattern,pred_index)
  pattern = pattern[1:len(pattern)]

We take a random starting point as a seed for generating new sequence, this will be our initial input to the predict function. We take this prediction and using numpy’s argmax function we check which index has the highest value, that is our predicted index (or token), we convert it back to a note or chord using the dictionary we created. We append our pattern sequence with this index and slice it to take everything except the first element, this becomes our input for the next iteration. Neat!

offset = 0
output_notes = []for pattern in prediction_output:
  if('z' in pattern) or pattern.isdigit():
    notes_in_chord = pattern.split('z')
    notes = []
    for curr_note in notes_in_chord:
      new_note = note.Note(int(curr_note))
      new_note.storedInstrument = instrument.Piano()
      notes.append(new_note)
    new_chord=chord.Chord(notes)
    new_chord.offset = offset
    output_notes.append(new_chord)
  else:
    new_note=note.Note(pattern)
    new_note.offset=offset
    new_note.storedInstrument = instrument.Piano()
    output_notes.append(new_note)
  offset += 0.5
midi_stream = stream.Stream(output_notes)
midi_stream.write('midi',fp='MidiOut/epoch200Mozart2.mid')

Now that we have our predicted sequence of notes and chords we just need to convert everything back to midi and we are done. We take each element in the predicted sequence and check if it is a chord or a note. If its a chord, we get all the building notes for it, stack them together and build a chord class, else we convert the note to a note class and put them in the same list. Here after every iteration we increase the offset by 0.5 else everything will play at the same time. Then we just use the write function to put everything in a midi file and we’re done!!

Result

Before we break it down and analyse things, if you made it here just take a break and pat yourself on the back. You’ve made it through a lot of coding and technical mumbo jumbo. You rock champ! Here is our results:

This is what our 100 epoch test sounds like

And here’s what 200 epoch sounds like:

Although to an untrained ear it sounds like someone’s falling down a staircase made with the keys of a piano, there are some patterns in there, especially in the 200 epoch one which is evident by the sheet given below. Around the 5th bar it starts ascending the scale like its building tension and then falls back into the same pattern. But if you keep listening, the longer it goes the weirder it gets. This is because in models like this the probability of correct prediction starts falling the further we go from the initial seed.

Conclusion

Some of you may argue that the sounds the model produced is absolute garbage and you’d be right. But it was never the purpose of this project to actually reproduce Mozart’s musical intellect. The purpose of this project was to learn and understand how RNNs work and to understand their interaction with different types of data. While there is no structure to the music, no beginning or ending I am still optimistic about what we can achieve with this method!

Here are some of the things we can do to improve our results: