Using Machine Learning to create Music!

Shreyans Jain
Analytics Vidhya
Published in
5 min readJan 19, 2020

Yes, you read that right! Machine learning can make music. This article will show you how we can generate simple piano compositions using a Neural Network Model in Keras. I would like to thank Sigurður Skúli for their tutorial as it proved to be a great source of reference and understanding in writing this article.

THE BASIC IDEA

A Piano composition is basically a sequence of two things:

  1. Notes
  2. Chords

The idea that we will be using in our machine learning model is that:

Given a sequence of notes and chords, our model should be able to suggest a suitable note/chord that should be played next.

So let’s say that our model takes as input a List of notes and chords that were played in sequence (and say that the number of items in this list is fixed at 100).

The output for this should be a single note/chord that our model suggests should be played next.

If our model is able to achieve this then using several iterations we would be able to create our own composition. For the sake of simplicity, In our composition each note/chord would be played at a time difference of 0.5s (unlike a real piano composition in which the time difference between two key presses may vary).

PARSING THE INPUT

The input to our train program will be a collection of MIDI files (.mid)

Using the music21 library available in python, we will be able to parse the MIDI files suitably. Each MIDI file would therefore be parsed into a list containing the notes and chords (in sequence) that represent that song.

A parsed song would look like this…

without getting into much detail about what each element of the list represents…It would be enough to know that elements in which numbers are separated by dots represent chords and other elements represent notes (played in different octaves…so B2 represents note ‘B’ played in the second octave)

We will be using the python library ‘music21’ to parse the input.

notes = []for file in glob.glob("MIDIS/*.mid"):
my_midi = converter.parse(file)
print("Converting... %s" % file)notesInSong = Nonetry: # file has instrument parts
s2 = instrument.partitionByInstrument(my_midi)
notesInSong = s2.parts[0].recurse()
except: # file has notes in a flat structure
notesInSong = my_midi.flat.notes
for element in notesInSong:
if isinstance(element, note.Note):
notes.append(str(element.pitch))
elif isinstance(element, chord.Chord):
notes.append('.'.join(str(n) for n in element.normalOrder))

The list type object ‘notes’ now contains a sequential ordering of the parsed form of the songs present in the folder ‘MIDIS’. I have used 33 midi files for the purpose of training. The files were collected from http://www.piano-midi.de/midicoll.htm.

PREPARING TRAINING DATA

Now we will prepare the training data for our Neural Network Model. The input is a sequence of 100 notes/chords and the output is the 101th note/chord that was played in the song.(The list type object ‘notes’ can be used to create the training dataset’)

length = 100# get all distinct notes/chords
notes_chords = sorted(set(item for item in notes))
vocab_size = len(notes_chords)
# create a dictionary to map notes/chords to integers
note_encoded = dict((note, number) for number, note in enumerate(notes_chords))
model_input = []
model_output = []
# create input sequences and the corresponding outputs for i in range(0, len(notes) - length, 1):
input_sequence = notes[i:i + length]
note_chord_out = notes[i + length]
model_input.append([note_encoded[char] for char in input_sequence])
model_output.append(note_encoded[note_chord_out])
N = len(model_input)# reshape the input into a format compatible with LSTM layers
model_input = numpy.reshape(model_input, (N, length, 1))
# normalize input model_input = model_input / float(vocab_size)model_output = np_utils.to_categorical(model_output)

CREATING THE MODEL

model = Sequential()
model.add(LSTM(
512,
input_shape=(model_input.shape[1], model_input.shape[2]),
recurrent_dropout=0.3,
return_sequences=True
))
model.add(LSTM(512, return_sequences=True, recurrent_dropout=0.3,))
model.add(LSTM(512))
model.add(BatchNorm())
model.add(Dropout(0.3))
model.add(Dense(256))
model.add(Activation('softmax'))
model.add(BatchNorm())
model.add(Dropout(0.3))
model.add(Dense(vocab_size))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

LSTM layers is a RNN layer that takes a sequence as an input and can return either sequences (return_sequences=True) or a matrix.

Dropout layers are used to drop a fraction of input. This helps in preventing overfitting.

Dense layers or fully connected layers is a fully connected neural network layer where each input node is connected to each output node.

The Activation layer determines what activation function our neural network will use to calculate the output of a node.

TRAINING THE MODEL

filepath = "values_weights.hdf5"
checkpoint = ModelCheckpoint(
filepath,
monitor='loss',
verbose=0,
save_best_only=True,
mode='min'
)
callbacks_list = [checkpoint]
model.fit(model_input, model_output, epochs=300, batch_size=128, callbacks=callbacks_list)

The values of weights for the model will be saved in the file “values_weights.hdf5” after training is complete.

USING THE MODEL TO COMPOSE MUSIC

Now we will use the model trained above to create our own piano composition.

LOADING THE MODEL

model = Sequential()
model.add(LSTM(
512,
input_shape=(network_input.shape[1], network_input.shape[2]),
recurrent_dropout=0.3,
return_sequences=True
))
model.add(LSTM(512, return_sequences=True, recurrent_dropout=0.3,))
model.add(LSTM(512))
model.add(BatchNorm())
model.add(Dropout(0.3))
model.add(Dense(256))
model.add(Activation('softmax'))
model.add(BatchNorm())
model.add(Dropout(0.3))
model.add(Dense(vocab_size))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
model.load_weights('values_weights.hdf5')

GENERATING A COMPOSITION

We will take a random input sequence from our model_input list. Let’s call this list pattern.This sequence is of length 100 and will help us generate the first note/chord of our composition. Now we will couple this note/chord with the last 99 items in pattern to make a sequence of length 100 again. This sequence will be used to generate the 2nd note/chord of our composition. This process will be carried on iteratively 500 times to generate the full composition.

start = numpy.random.randint(0, len(model_input)-1)note_decoded = dict((number, note) for number, note in enumerate(notes_chords))pattern = model_input[start]
#pattern = [note_encoded[char] for char in pattern]
#pattern = pattern[1:1+100]
output = []
# generate 500 notes
for note_index in range(500):
input = numpy.reshape(pattern, (1, len(pattern), 1))
input = input / float(vocab_size)
prediction = model.predict(input, verbose=0) index = numpy.argmax(prediction)
result = note_decoded[index]
output.append(result)
pattern.append(index)
pattern = pattern[1:len(pattern)]

The list type object ‘output’ contains sequential ordering of notes/chords that are present in our composition. We will generate a MIDI file now using this list.

offset = 0
output_notes = []
for pattern in output:
# pattern is a chord
if ('.' in pattern) or pattern.isdigit():
notes_in_chord = pattern.split('.')
notes = []
for current_note in notes_in_chord:
new_note = note.Note(int(current_note))
new_note.storedInstrument = instrument.Piano()
notes.append(new_note)
new_chord = chord.Chord(notes)
new_chord.offset = offset
output_notes.append(new_chord)
# pattern is a note
else:
new_note = note.Note(pattern)
new_note.offset = offset
new_note.storedInstrument = instrument.Piano()
output_notes.append(new_note)
offset += 0.5midi = stream.Stream(output_notes)midi.write('midi', fp='output.mid')

The ‘output.mid’ file contains the final generated composition.

RESULTS

The results were surprising but with one noticeable flaw. The output may contain long patches of repeated notes/chords (example- Output1 0:30–0:50). Apart from pattern repetition, The model suffers from lack of authenticity due to the limited vocab size (for notes and chords) that we have used.

--

--