Play Bach: Let a neural network play for you. Part 3.

pascal boudalier
Jun 13 · 6 min read

I do not know how to play music. But I can still play with music.

This is part of a series of articles, which explore many aspects of this project, including static MIDI file generation, real time streaming, Tensorflow/Keras sequential and functional models, LSTM, over and under fitting, attention, embedding layers, multi-head models, probability distribution, conversion to TensorflowLite, use of TPU/hardware accelerator, running inferences on Raspberry PI, ….

See Part 1. Part 2. Part4.

What could go wrong during training ?

In Part 2 we looked at how validation loss evolves as the network gets trained.

But your first neural network training may very well look like below. Why ?

There is a balance to be found between network capacity and the number of training samples. An unbalance will produce the type of curve shown above.

Capacity is defined as the number of variables in the model (1.2 millions in our case) and results from our choice of hyper-parameters. The more variables, the more ‘degree of liberty’ the network has. If the capacity is too high compared to the number of training samples, the network may end up ‘memorizing’ the training set and excelling at predicting from the training set, but failing when it is given an input outside the training set. It is unable of generalizing. Instead of learning, it has simply memorized the training set.

This is called overfitting and depicted on the diagram above. The evolution of the error on the validation set does not follow the progression of the error on the training set.

Preventing overfitting is a typical concern when training a model. This can be achieved by:

  • Making sure the network does not have too much capacity compared to the available training data, by tuning hyper-parameters.
  • Using some technique such as Dropout (remember, it is a layer in our model — see Part 2 of this series). Dropout randomly ‘neutralizes’ some of the model’s variables to make sure the network does not get ‘lazy’ and ends up ‘memorizing’ the training set (there is a lot of anthropomorphism there).
  • Getting more training data.

On the other end, if the capacity is too low, the network has no way to learn anything useful. This is called underfitting. In such a case, accuracy may not be much better than random guess.

Trial and error, experience, intuition and a good GPU are your friends to navigate between underfitting and overfitting

Running actual Training

The application is written in python and is available on github. It uses two key libraries:

Below are the application’s key files and directories :

In DEEP/music21/:models: directory for trained model and generated MIDI files
training: directory containing midi files

play_bach.py
: main python script. Performs both training and inference
my_model.py : TensorFlow model definition
my_midi.py : MIDI file creation
config_bach.py : configuration parameters
log_play_bach.log : created at each run

Before running the training, let’s review some configuration parameters in config_bach.py:

seqlen=40 # length of input sequence
epochs=70 # training will stop earlier if accuracy not improving
batch=64 # power of 2
concatenate = False
normal = False
model_type=1
app='cello'

model_type =1 is to use the model architecture we saw in Part 2. In future articles, we shall look at more complex architecture.

normal = False means that chords are encoded with octave information. When equal to True all notes in a Chord are assumed to be in the same octave (#4). Normal mode reduces the size of the output layer, at the expense of loosing octave information. This is to be used when the model gets really big. For example: when normal=False Chord A4.G5 is considered different from Chord A5.G5 and when normal=True they are the same, i.e. A.G.

concatenate = False means we only predict the note’s pitch. When True, the note’s duration is also predicted. We will address duration’s prediction in future articles.

app is a name you choose for your corpus. eg if app=’cello’, cello midi files should be present in the training directory and the resulting training model will be stored in the models directory as cello1_nc_mo (cello, model type 1, no concatenate, multiple octaves). Likewise, the generated midi file will be named cello1_nc_mo.mid.

Sequence length, batch size, number of epochs are hyper-parameters we have already described in Part 2.

To start training, execute “python -m play_bach.py -l 0 -f -pm” from the music21 directory.

python -m play_bach.py -l 0 -f -pm

l 0 is to create a new model (we could also load an already partially trained model and resume training).f is to Fit the model, i.e. to train it.pm is to run predictions on the trained model and create a MIDI file.

You will see some TensorFlow traces, showing how the model accuracy evolves, epoch after epoch :

Epoch 00030: val_accuracy improved from 0.61535 to 0.61721, saving model to /content/drive/My Drive/DEEP/music21/checkpoint/cp-030.ckpt
Epoch 00031: val_accuracy
improved from 0.61721 to 0.64238, saving model to /content/drive/My Drive/DEEP/music21/checkpoint/cp-031.ckpt
Epoch 00032: val_accuracy
did not improve from 0.64238
Epoch 00033: val_accuracy
improved from 0.64238 to 0.65458, saving model to /content/drive/My Drive/DEEP/music21/checkpoint/cp-033.ckpt

After a while, accuracy does not improve anymore. Training will stop at this plateau. Et voila ..

Epoch 00057: val_accuracy improved from 0.73717 to 0.74128, saving model to /content/drive/My Drive/DEEP/music21/checkpoint/cp-057.ckpt
Epoch 00058: val_accuracy
did not improve from 0.74128.
Epoch 00059: val_accuracy
did not improve from 0.74128
Epoch 00059: early stopping fit ended.
model fitted in 709 sec for 70 configured epochs , and 59 actual epochs

As explained before, we monitor accuracy on the validation set, not on the training set. In the example above, we specified a maximum of 70 epochs, but the validation accuracy stopped improving after 59 epochs and we got a 74% accuracy.

Playing the generated MIDI file

A quick internet search will return multiple applications to play MIDI files (eg VLC). You will need a SoundFont.

A MIDI file contains only codes for Notes. It does not inform if the note is from a piano or a TinkerBell. SoundFont is used to render MIDI codes with many instruments: mine can do applause, helicopter, gunshot, but more interestingly, many instruments I had have never heard of.

Instrument selection for MIDI file generation is a configuration parameter, located in config_bach.py

my_instrument = instrument.Clavichord()

Of course, for fun, you can experiment with different instruments than the one used in the original music.

In the next article we will have to decide where to go from here:

We could improve the user interface, or explore different model architecture, or look at the current performance and ways to improve it, or run the application on a variety of platforms, or use dedicated machine learning hardware, or put the application on the web, or build a docker container ….

Hopefully we will eventually do all of this

In the mean time stay tuned !!!

— — — — — Do not cross this line if you are not interested in details — — — — —

In part 2, I mentioned that the output layer will predict a number between 1 and 129 because there are 129 unique note/chord in our cello corpus.

This was a bit of an oversimplification.

To be more exact, the output layer is a Softmax, which is an array of 129 numbers whose sum is 1 by construction. It expresses the model’s prediction, as a probability (or confidence level).

For instance, in the example below , all those numbers are very small, excepted one (whose value is 0,98). This is the way the model says ‘I am pretty sure the following note is note number 102 (102 being the index of value 0,98 in the softmax array)

softmax_array([6.08105147e-05, 3.24967942e-07, 2.74516833e-06, 2.47346634e-05, 7.15374142e-08, 1.58419255e-06, 1.47027833e-06, 2.51121264e-07, 3.34257456e-05, 1.54934426e-07, 7.13374334e-07,
....
7.70791473e-07, 2.17805621e-07, 3.28565389e-03, 1.46220991e-08, 2.45054448e-08, 1.74178112e-08, 2.71401427e-06,
9.83197570e-01, 4.78140656e-08, 1.69292580e-08, 2.30138676e-06, 2.89644220e-07,
....
2.28157023e-06, 6.47835404e-05], dtype=float32)
softmax_array[102] = 0.98319757

Index 102 in the list of all unique note/chord is ‘G#2’. This is the model prediction !!!

list_of_all_unique_note-chord: [‘A2’, ‘A2.F#3’, ‘A2.F3’, ‘A3’, ‘A3.B3’, ‘A3.F#3’, ‘A4’, ‘A4.G#4’, ‘B-2’, ‘B-2.E3’, ‘B-2.G#3’, ‘B-2.G3’, ‘B-3’, ‘B-3.D3’, …]list_of_all_unique_note-chord[102]: ‘G#2’

If the highest value in the softmax array is not significantly higher than all the other values, it means the model has no real clue (and therefore you don’t have either)

So , was crossing the line worthwhile ?

Nerd For Tech

From Confusion to Clarification

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/.

pascal boudalier

Written by

Tinkering with Raspberry PI, ESP32, Solar, LifePo4, mppt, IoT, Zwave, energy harvesting, Python, MicroPython, Keras, Tensorflow, tflite, TPU. Ex Intel and HP

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/.