Music Generation using MusicTransformer with codes

Training Seq2Seq model for drum audio dataset in python

Published in

Data Science in your pocket

9 min readJan 23, 2023

In my past blogs on generative modeling, I was mostly focused on image generation. Twisting the plot a bit, I will be exploring Music generation using Generative modeling now. A part-time musician myself (my Instagram stories are quite fun !!), I was quite eager to explore this topic in particular for a long time. So, let’s begin my 102nd blog:

My debut book “LangChain in your Pocket” is out now

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs eBook : Gupta, Mehul…

www.amazon.in

Before we jump, there exist a few technicalities that one must know:

Monophonic music: Only one melody going on in the song (say just drums or guitar or playback singer’s voice but just one type of sound). I would b handling monophonic music for this post
Polyphonic music: Multiple melodies going on simultaneously. Any song from Bollywood or Hollywood is an example of Polyphonic music where multiple instruments are getting played together.
Symbolic representation: Representing a piece of music using notes & timing but no sound. For example, You must have seen in movies musicians keeping papers with symbols like ♪ ♫ ♬ . Such representation of music is an example of symbolic representation. We will soon talk about MIDI files which are a type of Symbolic representation of music
Sub-symbolic representation: Representing a piece of music using notes, timing, and some sounds.
MIDI File: It’s the most commonly used format for representing music in Symbolic representation. Below is an example of what a MIDI for different instruments looks like.

To understand a MIDI file, we need to know a few music related concept

MUSIC 101s

Note: Any single musical sound is called a note. You must have heard of Sa, Re, Ga, Ma, Pa, Dha, Ni, and Sa. These are the 7 notes of music.
Pitch: The pitch of a sound piece is, in layman's terms, how High or Low the note is. So, though we have 7 notes, depending on the pitch, we can have many variants of these notes.
Duration: It's the length for which a particular note is played.
For a deeper explanation of music 101s, do follow this link. You won’t be requiring any other information on music for this blog.

What does a MIDI file look like? let’s see

I would be moving ahead with this drum dataset for the rest of the post. Kindly download it (~3–4 GB) to be in sync

For this, you need to follow a sequence of steps

Install MuseScore
pip install music21 library for python
Run the below command

import music21
music21.configure.run()

Once done, run the below code snippet in cmd after launching python(won’t work in jupyter notebook)

import music21
from music21 import *
us = environment.Environment()
#path where musescore software is installed
us['musescoreDirectPNGPath'] = 'C:/Program Files/MuseScore 4/bin/MuseScore4.exe'

#MIDI file to read. Change this according to your data
file = "groove/drummer1/session3/9_rock_135_beat_4-4.mid"

original_score = music21.converter.parse(file).chordify()
original_score.show()

You can expect to see a similar pop-up from the MuseScore software

I am not jumping into the technicalities of the diagram, this is just for reference. You can play around with it in your free time

You might not be able to understand MIDI files straightway, but the textual version is comparatively easier

For that,

original_score.show('text')

A few things that we should know before moving ahead

The MIDI file has some metadata in the beginning like what instrument is being used, tempo (speed of beat), clef, etc
The actual sound starts at beat 4 (see 4,0 written before the 9th line in 1st image)
Each note has a duration of 4 beats (as you can see 4,8,12,16… before every note)

We won’t be jumping into much complexity else might need to attend music classes.

Now, we are done enough with the basics required for this post, it’s time we bring in AI into the picture.

So, a MIDI file can be taken as a sequence of Notes with other information as well like duration, pitch, etc. So, it’s sequential data similar to text datasets but with a twist,

Here we have multiple sequences running together, one each for pitch, time, velocity, etc. Hence, multiple outputs as well. Tricky? Let’s discuss how to frame this problem first

I will be discussing 3 perspectives on modeling music generation

Preparing training dataset
How to handle multiple sequences while doing prediction using multi-sequence prediction
Suitable models for multi-sequence prediction

Preparing Dataset

It involves majorly 3 steps

Parsing the MIDI files
Generating sequences from the parsed dataset

Parsing MIDI file

As MIDI files can’t be directly fed to any model. We need to convert it into some apt format (say a dataframe or tensors). This can be done using the music21 library which can parse through MIDI files. As you saw in the textual format, there exist some sequential data that would be extracted and used for training

We would be extracting just duration and pitch for now for each MIDI file i.e. 2 sequences.

Here we are handling just monophonic sounds hence just 1 sequence-pair of pitch and duration else we will have one sequence pair each for every instrument played for polyphonic music

So let’s first load all the required libraries

import music21
import time
from music21 import *
from tqdm.notebook import tqdm, trange
import tensorflow as tf
import pandas as pd
from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs
import logging
import glob
import string
from sklearn.model_selection import train_test_split as tts
import logging

Reading MIDI files (hope you are following the Groove drum dataset link shared)

#getting all MIDI files name
files = glob.glob('groove/*/*/*.mid')

training_notes = []
training_duration = []

for file in tqdm(files):
        notes = []
        durations = []

        #parsing MIDI files one by one
        original_score = music21.converter.parse(file).chordify()

        #depending on the element found in the instrument. Like we would have 'rest' for drums,
        #chords for guitar
        for element in original_score.flat:
            note_name = None
            duration_name  = None

            #metadata  
            if isinstance(element, music21.key.Key):
                note_name = str(element.tonic.name) + ':' + str(element.mode)
                duration_name = "0.0"
            
            #metadata
            elif isinstance(element, music21.meter.TimeSignature):
                note_name = str(element.ratioString) + 'TS'
                duration_name = "0.0"

            elif isinstance(element, music21.chord.Chord):
                note_name = element.pitches[-1].nameWithOctave
                duration_name = str(element.duration.quarterLength)

            # As using drums data, elements found would be 'rest'
            elif isinstance(element, music21.note.Rest):
                note_name = str(element.name)
                duration_name = str(element.duration.quarterLength)

            elif isinstance(element, music21.note.Note):
                note_name = str(element.nameWithOctave)
                duration_name = str(element.duration.quarterLength)

            if note_name and duration_name:
                notes.append(note_name)
                durations.append(duration_name)
        
        #notes and duration hold seuence for one music piece. 
        training_notes.append(notes)
        training_duration.append(durations)

Let’s see a sample

Generating sequences

This is a comparatively easy step, where we will, using the sliding window approach, consider ’Mth →Nth’ notes as one sequence for input and output ‘Mth+1 →Nth+1’ notes where ‘x’ number of notes lie between Mth & Nth note. Similarly, we would be sliding over ‘duration’ and creating pair of input and output sequences as for Notes.

How would we be able to handle multiple input sequences and multiple outputs?

For multiple inputs, we would be concatenating both the input and output sequences of Notes and Duration respectively. To maintain a separation, we can introduce a ‘NA’ token before concatenating the two sequences (though not necessary). So, the final input & output sequence would be

[M_nth…….N_nth] + [‘NA’] + [M_dth……N_dth] →
[M+1_nth…….N+1_nth] + [‘NA’] + [M+1_dth……N+1_dth]

Where

LHS is the input and RHS is the output sequence,
M_n, N_n represent tokens in the sequence of the notes
M_d, N_d represent tokens in the duration sequence

Note: This is not the only way of mixing multiple sequences, you can try out other methods as well like toggle (one token note, other token duration), a 2d array with one row for notes, another for the duration, etc. The final modeling might change accordingly.

Let’s generate input and output sequences using the above-parsed MIDI files

train = []
label = []
window_size = 20
for x,y in zip(training_notes,training_duration):
    
    if len(x)>window_size:
            for index in range(len(x)-1-window_size):
                in_1 = x[index:index+window_size] + ['NA'] + y[index:index+window_size]
                out_1 = x[index+1:index+1+window_size] + ['NA'] + y[index+1:index+1+window_size]
                train.append(in_1)
                label.append(out_1)
    else:
        pass

After doing this, we would be converting the tokens in duration and notes to a standard token sequence using ASCII characters and convert into a single string (this is the format expected by the model I would be training)


train_df = pd.DataFrame({'input_text':[x for x in train],'target_text':[x for x in label]})
map_dict = {y:string.ascii_uppercase[x] for x,y in enumerate(sorted(train_df['input_text'].explode().unique()))}

train_df = train_df.applymap(lambda x: ' '.join([map_dict[y] for y in x]))

training, validation = tts(train_df)
training = training.sample(len(training))

The logic followed is quite simple

Create a DataFrame using the input and output sequences generated above.
Get all unique tokens used (notes+duration both combined) and assign ASCII characters for each unique token’s representation.
Replace tokens used with the new mapping and form a single string separated by spaces.
Do a train-test split and shuffling.

Let’s observe the training data now

The input column which is a string of tokens is named input_text while the output column which is again a string of tokens is named output_text as this is the format expected by Seq2Seq model we would be using next from simpletransformers library. This would change if we change the model we are using.

Suitable models

Similar to text problems, we can use models like LSTMs, Transformers, BERT, GPT, etc for next token/note prediction in music generation.

The specialized transformers we have for music generation are called Music Transformers which can handle multiple input sequences and generate multiple output sequences which is nothing but Decoder part of Transformers. The architecture is very similar to the Transformers we used for text problems. The only change brought in is before feeding various sequences, they are concatenated to each other and fed as one sequence as shown above. Similarly, the output is also broken into various sequences and fed to different dense layers before the final output. Do you remember we added a ‘NA’ token, that can be used now for separating the predictions for notes & duration?

For modeling, we would be taking the help of simpletransformers library’s Seq2Seq class. Though you can use any Seq2Seq model implementation. The implementation is quite easy as follows:

#setting up logs
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

#model args to be used
model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 41,
    "train_batch_size": 10,
    "num_train_epochs": 5,
    "save_eval_checkpoints": False,
    "save_model_every_epoch": False,
    "evaluate_generated_text": True,
    "evaluate_during_training_verbose": True,
    "use_multiprocessing": False,
    "manual_seed": 4,
}

encoder_type = "roberta"

#The 1st 3 parameters are model type, encoder and decoder exact model name to be used
model = Seq2SeqModel(
    encoder_type,
    "roberta-base",
    "bert-base-cased",
    args=model_args,
    use_cuda=False,
)

model.train_model(training)

Also, you might face some issues with using the simpletransformers library if TensorFlow 2 is installed in your system. A few hacks I used to train my model

Uninstalled TensorFlow & pytorch, conda install pytorch and install the simpletransformers library first.
If facing issues like TensorFlow has no attribute like io or tensor, update the library files for simpletransformers(very easy, using traceback check which files have error) and either comment the code or change logic.

To understand more about the Seq2Seq implementation alongside all the config parameters used, do refer to the link here. For config params, here we have an explanation

As training this model on my local would take days, I am skipping the training part after 1 epoch as the whole idea of the post is to demonstrate a working pipeline. Though let’s see the evaluation_loss

results = model.eval_model(validation)

This evaluation loss is nothing but cross-entropy

So, this time you saw how we can use generative modeling for generating music as well. Though we can have different models that can be used or even different preprocessing methods, the core idea of the pipeline would remain the same. The way of mixing embeddings can be different as well. Try training the model for a longer duration and expect better results.

A big thanks to the author of this book for reference.

It’s a wrap till then