Music & NLP: Domain Expansion in Transfer Learning

Exploring the extent to which existing natural language processing models can transfer to new domains.

Published in

Institute for Applied Computational Science

9 min readDec 29, 2020

Authors: Lech Brzozowski, Hugo Fernandez-Montenegro, Andrew Smith, and Xiaosong Xu
This article was produced as part of the final project for Harvard’s AC295 Fall 2020 course.

Introduction

The last decade has seen an incredible amount of resources put toward machine learning methods that can comprehend and/or generate human written language. From embeddings and transformers to the BERT and GPT models, researchers have invested heavily in developing open-source tools available for public use. More recently, researchers have explored techniques that allow these tools to be used in new domains. However, these methods have been limited to selecting target domains that are either different fields of text or new languages. But, how far can the target domain deviate from the source domain? Understanding the edges of this boundary will allow existing work to be utilized in as many new domains as possible.

In an attempt to explore the boundaries of transfer learning, our group decided to test a series of existing NLP models using musical MIDI data as input. Music is often considered a language itself, and models that utilize attention and transformers might be able to extract the higher-level theory behind it.

Data Collection and Processing

The first step in our adventure was locating a dataset that included a large number of MIDI files. While there are a few available as part of Tensorflow’s Magenta project, we decided to proceed with the Lakh MIDI Dataset v0.1¹. This dataset includes over 176,000 unique MIDI files from all sorts of music genres.

If you have never worked with MIDI files, they are rather interesting. A single file can contain up to 128 different instruments, as well as information about notes, chords, tempo, velocity and more. The first challenge we encountered was deciding how much of this information we wanted to retain for our experiment.

The top image above is an example of a typical MIDI file which includes multiple instruments. The bottom image shows that same section of the MIDI file, but where we have isolated only the bassline instrument. To keep the data as simple as possible, we decided to extract the most popular instrument in the dataset (string piano).

The next item we had to determine was how best to translate this MIDI data into a representation which an NLP model would be able to understand. While we first explored the possibilities of using methods which produced embeddings, a simpler technique presented itself.

Utilizing MIT’s Music21 library, we were able to directly translate the information in a MIDI file to raw string output. Below is an example of the text produced using this library.

<music21.tempo.MetronomeMark animato Quarter=120.0><music21.meter.TimeSignature 4/4> <music21.note.Rest rest>
<music21.note.Note F> <music21.note.Note A> <music21.chord.Chord C4 A3>

With the help of a custom function, we were able to extract the key elements of the songs that we needed. Using this method we were able to generate a text representation for all of the songs in our dataset. However, we did lose some of the fidelity in the music using this method. Below is an example of Frédéric Chopin starting with the original MIDI input.

A defining characteristic of this song is the time duration between each note. The song starts off with a long pause after the first note, while quickly following up with a sequence of notes in rapid succession. Now, here is that same song post-processing.

The processing removed elements of the MIDI file related to tempo and rests between notes. Despite this limitation, there was sufficient information available for our experiment. Finding the best solution to incorporate time related information from the song is an open area for further research.

Base LSTM Model

Before starting the process of transfer learning, it was important to first train a base model for a comparison. To do this, we constructed a basic LSTM model in Tensorflow consisting of multiple LSTM layers. This model was then trained on MIDI files that contained only the piano that had been processed using the methods described earlier. From this model we were able to produce the following music. We recommend turning down your speakers first.

This “music piece” is rather poor, but there are hints of a melody in it. Something that we discovered while training this model is that it is very easy to overfit. Anytime this model was allowed to train for more than three epochs, the generated music would consist of only a single chord repeating over and over.

Transfer Learning — Version 1.0

Now that we had the data processing and base model set, we could turn our attention to the main focus of our experiment. However, we first had to decide which preexisting NLP model would be best for our process. The decision came down to Google’s BERT and OpenIA’s GTP-2 models. Ultimately we went with GPT-2, as it was better able to predict sequences and didn’t require the same tokenization process as BERT. This would hopefully prevent us from losing any additional fidelity in the songs.

Fine-Tuning OpenAI’s GPT-2

There are various different approaches we could have taken to fine-tune the GPT-2 model. The first possibility would have been to use the model directly from OpenAI’s Github repository. Alternatively, we could have also used Huggingface’s Transformer library to achieve similar objectives.

But, while researching options, we found a Python library called GPT-2-SIMPLE. We decided to use this library as it is a nice wrapper around the GPT-2 pre-trained model that required minimal setup and contained two convenient methods: finetune for fine-tuning, and generate to auto-generate text.

The fine-tuning process is actually quite straightforward. First, we downloaded the pre-trained model from Google Cloud using the gpt2.download_model feature, after which we can call gpt2.finetune. The finetune method takes the name of a text file and also the number of iterations to train the model. In our case, we used 1000 iterations.

After fine-tuning is finished, the model is deposited into a checkpoint directory where we can then call the gpt2.generate method to auto-generate music notes. By default, it will start to generate music without prompt. But we can optionally specify a prefix, which sets the context for the music notes that are generated.

Below is an example of the music generated using GPT-2 with an input structure similar to: Note E, Note A, Note C, Rest rest, Note E, Note A, Rest rest, Note B, Rest rest, Note B, Note G, Note D, Note G, Rest rest

This music piece is significantly better than the original music generated using the base LSTM model. There are clear harmonic patterns and a repeating melodic structure. It is interesting to note that the output generated by GPT-2 is pretty much all in the same semantic format as the input. This indicates that the fine-tuned GPT-2 model has learned the “grammar” of this quite different music domain.

Visualizing the Model’s Attention

Beyond simply hearing the output generated by this model, finding a technique to visualize the inner workings of a model could help us improve the fine-tuning process. To do this, we used BertVIZ, an open-source tool for visualizing self-attention within the BERT NLP model. BertViz works by displaying self-attention in a correlation graph, representing the contribution each token makes to a given prediction.

Even though BertViz is designed for BERT, as its name suggests, it can work with any attention-based NLP neural network. In fact, BertViz supports HuggingFace’s GPT2 transformer with this GPT2Model and GPT2Tokenizer. However, the GPT-2-SIMPLE library we used does not provide the attention outputs required.

Therefore, we had to make some modifications to GPT-2-SIMPLE. From a very high level, the library contains two internal elements that holds information about the past and present. The past is the input to the layer and the present is the output from the layer. In a way, this is the attention matrix for the GPT-2 model. Using transposition and matrix multiplication, we were able to update the library so that it would output attention.

With this attention matrix and the token list, BertViz readily outputs an attention graph.

Visualization Showing Multiple Layers of Attention

Above is an image showing the attention from three different layers of GPT-2 on the same piece of music. There are a few notable observations to be made here. The first is that our music attention results seem a bit more scattered than an average language model. There doesn’t seem to be a clear relationship between any one individual note and another. The second observation is that later layers show more concentrated attention.

The key take away we got from using this visualization tool was the importance of input formatting. Frequently we found that the model would have a lot of attention on elements that weren’t significant to the music itself. After exploring several different formatting options, we discovered that the simpler the notation, the better the model performed.

Transfer Learning — Version 2.0

Using the information gained from BertViz, we decided to pre-train GPT-2 again, but this time using a simpler notation. Instead of Note E, Note A, Note C, Rest rest, we would use E5, E-5, E5, E-5, E5, B4, D5. Below is a song generated using this new notation.

While it is hard to ever definitively say one song is better than another, given it is often in the ears of the listener, our team felt the music generated with this notation was the best. Compared to the first attempt at fine-tuning the model, this method had much more pitch variety. However, we did lose some additional flexibility as this notation contained no rests. The song could have been even better if it would introduce a pause or two along the way.

Conclusion and Future Work

Given the time and energy that is required to train the latest state-of-the-art models, it is important to find new ways to transfer these efforts toward as many new domains as possible. Serious questions about sustainability must be raised if every company in the world has to perform extensive training for every new business problem they face.

Despite the fun nature of using music in our experiment, we were able to demonstrate that knowledge transfer within NLP models doesn’t have to be limited to moving between legal documents and medical journals. Even though the music generated has room for improvement, it does show us the potential for any sequential data. With a little bit of forethought, and the help of visualizations, we might be able to bend the original purpose of existing models further than one might think.

Future Improvements

Below are a few areas for potential improvement and/or future research.

Music Encoding — Currently the model tokenizes each individual character of the music input. Having the model recognize a note as a single unit, example ‘C#2’ instead of ‘C’, ‘#’, ‘2’, could result in better performance. This would be possible with modifications to the GPT-2 encoder.json and vocab.bpe files.

Music Embedding — During this process, we briefly explored ways to embed the music files using the GloVe encoding pattern. It is feasible that this technique could allow additional NLP models to be tested.

Genre Curation — The current dataset being used includes songs from many different genres of music. This might cause confusion while fine-tuning, as the structure of a baroque piano solo is significantly different from the sparse use of piano chords in a house or techno song. Separating the dataset according to genre could improve final output.

Additional MIDI Information — Our current modified training dataset contains a single instrument with only pitch and timing information. We have not included the rich data of velocity, duration, and other instruments. For future efforts, it would be interesting to devise an encoding mechanism that includes this additional information in order to get richer and more engaging music output.

[1]: Colin Raffel. “Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching”. PhD Thesis, 2016.