Deep Learning and Music Generation

— Part 1. What I Learned from The Godfather

Using deep learning (specifically recurrent neural network) to compose music is a cool idea. This post presents my study about using different recurrent neural networks to generate/simulate a soundtrack in The Godfather. The program is based on supervised learning with a simple note-by-note prediction approach, and is implemented by Python together with libraries such as Keras, MIDI, NLTK, and Pygame.

Original Music and Generated Music

First things first, here are two tracks: the original soundtrack and the sound generated/simulated by recurrent neural networks.

Some comments about these two songs:

  • The second track titled “Godfather Love Theme (LSTM)” is generated/simulated by a special type of recurrent neural network (Long Short-Term Memory a.k.a. LSTM).
  • The format of music processed by the program is actually MIDI. Because the website hosting the sound (i.e., SoundCloud) doesn’t support MIDI, I have to convert .mid files to .mp3 files for uploading/streaming.
  • The generated/simulated track creates music notes only in a single channel and with a single instrument. Consequently, it is not as rich as the original one.
  • The soundtracks generated/simulated by other recurrent neural networks are also stored on my SoundCloud account.

Processing the MIDI files and building/training the recurrent neural networks will be explained in following sections.

MIDI File Processing

The format of music managed by this project is MIDI. Reading, parsing, and creating MIDI files is handled by Python MIDI library, while tokenizing MIDI data to extract notes/volumes/channels is handled by Natural Language Toolikit (NLTK). A simple Pygame program plays music notes Do-Re-Me is shown in Fig.1.

Fig 1. a simple Pygame player

The following figure shows 424 notes of the original soundtrack in The Godfather.

Fig.2 the notes in the original soundtrack

Converting a sequence of notes (Fig.3) to a set of training data (Fig.4) follows the idea of supervised learning and a simple note-by-note prediction approach: taking 10 notes as the training data and the next note as the labeled data, and sliding through 424 notes to produce 414 training data and labelled data.

Fig.3 the first 20 notes in the original soundtrack
Fig.4 the first 10 training data and labelled data

Recurrent Neural Network

Keras provides three kinds of recurrent layers for building a recurrent neural network: Simple RNN, Gated Recurrent Unit (GRU), and Long Short-Term Memory (LSTM).

Fig.5 Keras RNN/LSTM example

The code building and training a LSTM recurrent neural network is shown in Fig.5. The network has only four layers: input layer, recurrent layer, dense layer, and activation layer, while the loss function is categorical cross entropy and the activation function is softmax, following the idea given by Chapter 6 in Deep Learning with Keras.

Some tricks in the code:

  • Line 23 — plot the network (See Fig. 6)
  • Line 26 — define early stopping
  • Line 30~41 — plot the loss history in 500 epochs with early stopping (See Fig. 7)
Fig.6 the architecture of three recurrent neural networks
Fig.7 the loss history under SimpleRNN, GRU, and LSTM