Practical Tips for Training a Music Model

Andrew Shaw
Towards Data Science
7 min readAug 13, 2019

--

This is Part II of the “Building An A.I. Music Generator” series. We’ll be taking a deeper dive into building the music model introduced in Part I.

Here’s a quick outline:

Data Encoding and how to handle:

  • Polyphony
  • Note pitch/duration

Training best practices:

  • Data Augmentation
  • Positional Encoding
  • Teacher Forcing
  • TransformerXL Architecture

Note: This is a more technical post and not critical for understanding Part III/IV. Feel free to skip ahead!

Quick Rehash.

In the previous post, I mentioned the 2 basic steps to training a music model:

Step 1. Convert music files into a sequence of tokens:

Step 2. Build and train the language model to predict the next token:

This post will be divided in the exact same steps. Only this time, we won’t be glossing over the details.

Step 1. Converting music to tokens

Start with the Raw Data

We’ll be using a dataset comprised mostly of MIDI files . It’s one of the most popular digital music formats and there’s a ton of these files on the internet. More data is deep learning’s best friend.

Raw MIDI is represented in bytes. Even when converted to text, it’s not very human readable.

Instead of displaying MIDI, I’ll be showing you something like this:

Even if you can’t read music, it makes more sense than this

Tokenization

Turns out, there’s a couple of gotchas to keep in mind when encoding music files to tokens. For text, it’s a pretty straight forward conversion.

Here’s how you would encode text to a sequence of tokens:

Vocabulary: { 
'a': 1,
'is': 2,
'language': 3,
'like': 4,
'model': 5,
'music': 6
}
Text: “a music model is like a language model”
Tokenized: [1, 6, 5, 2, 4, 1 3, 5]

It’s a straight one-to-one mapping — word to token. You can do other types of encoding like splitting contractions or byte-pair encoding, but it’s still a sequential conversion.

Music however, is best represented in 2D:

Here’s another way of looking at it:

This is a plot of frequency across time (also known as a pianoroll). You’ll notice 2 things about this graph:

  1. A single music note is a collection of values (pitch+duration)
  2. Multiple notes can be played at a single point in time (polyphony)

The trick to training transformers with music is to figure out how to tokenize this 2D data to a single dimension.

Notes— One to Many

  1. A single music note represents a collection of values:
    - Pitch (C, C#, … A#, B)
    - Duration (quarter note, whole note)

Note: There are actually more attributes to a note such as instrument type (drums, violin), dynamics (loudness), and tempo (timing). These aren’t as important for generating pop melodies, but are helpful for generating performance pieces.

The easiest way to go about this is to encode a single note into a sequence of tokens:

Note: Another option is to combine the values into a single token [C:QTR,D:QTR,E:HLF], but that means a larger vocabulary size and less control over predictions.

Polyphony — Many to One

How do you know when to play a sequence of notes all at the same time or in successive order?

Another music model called “bachbot” has a clever solution to this. Play notes sequentially if it’s separated by a special “SEP” token. If not, play all the notes at once.

Putting it all together

Put this all together and you have the example we showed at the very beginning:

If you prefer Python over English, try out this notebook.

Step 2. Training best practices

Now we’re ready to train on our tokenized data. This will be the same training code from Part I, but with a couple more features enabled.

Let’s go over the code line by line.

Data Augmentation

Line 3: config[‘transpose_range’] = (0, 12)

Data augmentation is a great dataset multiplier. A single song is transformed into 12 songs of different keys!

item.to_text() # Key of C
Tokens: xxbos xxpad n60 d4 n52 d8 n45 d8 xxsep d4 n62 d4
item.transpose(4).to_text() # Key of E
Transpose: xxbos xxpad n64 d4 n56 d8 n49 d8 xxsep d4 n66 d4

Data augmentation seems to help with generalizing key scales and beat. However, in the final few epochs of training, I remove the augmentation and keep everything in the key of C.

It’s much easier for both machines and humans to predict and play all white keys.

Positional Beat Encoding

Line 2: config[‘encode_position’] = True

Positional Beat Encoding is extra metadata we feed our model to give it a better sense of musical timing.

As we saw in the tokenization step, converting notes to tokens is not a 1-to-1 mapping. This means the position of the token doesn’t correspond to its actual position in time.

item.to_text()
Token: xxbos xxpad n60 d4 n52 d8 n45 d8 xxsep d4 n62 d4
list(range(len(item.data)))
Index: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
item.position//4
Beat: 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2

Token at index 7 is actually played on beat 1.

If we send our model the “beat” metadata along with the tokens, it will have a lot more contextual information. It no longer has to figure the musical timing on its own.

Teacher Forcing

Line 4: config[‘mask_steps’] = 4

When training transformers, you usually apply an attention mask to keep the model from peeking at the next token it’s supposed to predict.

lm_mask(x_len=10, device=None)               # 10 = bptttensor([[[[0, 1, 1, 1, 1, 1, 1, 1, 1, 1],    # Can only see itself
[0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]]]) # Can see everything
# 0 = Can see token
# 1 = Can't see token

Each row represents a time step of what tokens it can or cannot see.

Instead of only masking the future tokens, you can additionally mask the few tokens before. This forces the model to predict several steps ahead, and ideally produces a more generalized model.

window_mask(10, None, size=(2,0))           # Window size of 2tensor([[[[0, 1, 1, 1, 1, 1, 1, 1, 1, 1],   # Only sees itself
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1], # Only sees the previou
[0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1]]]])# Can't see final 2

Think of this as a reversed “teacher forcing”.

Transformer Architecture

Line 6: model = get_language_model(arch=MusicTransformerXL…)

TransformerXL is a specific flavor of the Transformer model. It features relative positional encoding and hidden state memory.

Transformer Memory enables super fast inference. Instead of having to re-evaluate the whole sequence on every prediction, you only need to evaluate on the last predicted token. Previous tokens are already stored in memory

Relative position — vanilla transformers use absolute position only. It’s very important for music models to know the position of each token relative to one another. This is in addition to our positional beat encoding

Training Loop

learn.to_fp16(dynamic=True, clip=0.5);
learn.fit_one_cycle(4)

The training code comes for free with the fastai library. I won’t go into details, but Mixed Precision, One Cycle, and Multi-GPU training will save you a lot of time.

The End.

Whew! Now you know everything I know about MusicTransformers. Now we can build on that concept and create an even cooler music model— The MultitaskTransformer.

Part III. Building a Multitask Music Model A beefed up MusicTransformer. It can harmonize, generate melodies, and remix songs.

Part IV. Using a Music Bot to Remix The Chainsmokers Remix an EDM drop in Ableton with musicautobot. Pure entertainment purposes only.

Thanks for reading!

Special Thanks to Jeroen Kerstens and Jeremy Howard for guidance, South Park Commons and PalapaVC for support.

--

--