GPT: Yes, and It Can (Kind of) Play Jazz Too

Marcos Acosta
The Startup
Published in
9 min readDec 8, 2020

Introduction

After having been exposed to a couple months of GPT-mania, I’ve developed an involuntary mental eye-roll whenever I come across a post or article about something else that the transformer can do. And yet, here I am, adding flame to the fire– oh, well.

This past semester I’ve had the privilege of leading a small team in a project I’d wanted to tackle for a while: developing a machine learning model with chops. As a jazz pianist myself, it’s a natural challenge to take on, especially since there’s been enough research put into it to be interesting without being overstudied. Throughout the project, we dedicated a good amount of time to developing an Encoder-Decoder LSTM to “translate” from one musical phrase in our jazz corpus to another, but another idea came up early in the semester: could a model trained on natural language like GPT learn jazz? It was an interesting question, and the short answer, somewhat unsurprisingly, was yes-ish.

Note: it’s a bit tricky to get access to GPT-3 but GPT-2 has been made easily available and particularly convenient thanks to Max Woolf’s Google Colab notebook.

Putting the cart before the horse, here’s some of the music we were able to generate with GPT-2.

The first second is the seed alone, which is fed as a prompt to the model.

I. Preprocessing

Unlike classical music, there doesn’t appear to exist a 100-GB database of freely accessible jazz MIDI. However, a good number of standards have been compiled by Doug McKenzie in this repository. We first downloaded all files from this dataset and extracted the piano part from each. Our first design challenge was in choosing a resolution: the granularity of our transformation of the original MIDI into discrete timesteps. To better understand this concept, consider two extremes:

Low resolution (high information density)
Let’s say we convert each quarter note in the MIDI to 2 timesteps in our final dataset. This would work just fine for quarter notes and eighth notes, but sixteenth notes won’t be accurately captured, nor will eighth note triplets. This might be okay for some types of classical music, but it won’t fly for jazz.

High resolution (low information density)
To avoid this problem, we instead convert each quarter note in the MIDI to 180 timesteps. This way, you can perfectly capture sixteenth notes, eighth note triplets, sixteenth note triplets, and the odd quintuplet. The problem? One quarter note is 180 timesteps. So, if you want your model to learn something about the structure of the music, it will have to look back by over a thousand timesteps just to see the past two measures, and the vast majority of it will be the same notes held down for hundreds of timesteps. This can make it incredibly difficult for a model to “get” how notes change from moment to moment.

As we’ve just seen, 2 is definitely too small and 180 is definitely too big. The Goldilocks spot we chose at the beginning was 12 ts/qn (timesteps/quarter note) to capture up to sixteenth notes and eighth note triplets, but we also experimented with an even lower resolution of 6 ts/qn that forgoes sixteenth notes for higher information density.

Timestep resolution accounts for rhythm, but what about harmony? Different tunes are performed in different keys. Note that we could have left this in our dataset and expected our model to learn the independence of music on key, but we chose to standardize key as a starting point. To do this, we wrote an algorithm to find the note distribution of an entire song and determine which idealized key distribution it most closely resembles by cosine similarity, and then transposes the entire piece accordingly. After performing this normalization, we could see that our algorithm worked roughly as expected, and that the most common keys in our dataset were C, F, and Bb.

The most common keys in our jazz piano dataset were C, Eb, F, G, and Bb, as expected.

After this normalization, we had successfully converted a MIDI dataset into key-standardized 2D “piano roll” arrays. Represented as a grayscale image, they look like this:

Note that while time reads from left to right, higher notes are near the bottom of the image (higher indexes).

It’s worth noting that some tracks were waltzes in 3/4, which may be undesirable for standardization. In the future, this should be accounted for.

II. Piano roll to text representation

A number of methods exist to transform music into a “readable” form. For example, in this paper by Hilscher and Shahroudi (2018), the researchers used an individual character for all 88 possible keys and separated timesteps with a space, like so:

U IQU JNS JNS JNZ JNZ zIX zIX ziQ zIQ zGV [...] "Q 9Q 8 9Q 'P S

We chose to take a different approach– or two different approaches, to be precise. For one, we stuck with the actual names of the notes (e.g. c5 or g#6), where notes in a chord are separated by whitespace and timesteps are separated by newline. We chose this representation because we saw value in relating the same note in different octaves. Rests are indicated with the letter w.

d2 c3 d5 f5 a5
c3 d5 f5 a5
d5 f5 a5
d5 f5 a5
d5 f5 a5
d5 f5 a5
w
w

Our second approach was slightly more subtle: to indicate only when notes begin and end using special characters < for “begin”, < for “end”, and ^ for “tap” (for notes that begin and end within the same timestep). In this encoding, w doesn’t necessarily mean there are no notes being played, but that nothing has changed.

^d2 <c3 <d5 <f5 <a5
>c3
w
w
w
>d5 >f5 >a5
w
w

The advantage of this representation is that is more closely aligns with how piano is actually played: a key is pressed, and then it is released. In the first encoding, there is no notion of this, only notes present at a given timestep. However, this representation does leave more up to the model to “figure out”.

Additionally, I chose to include measure bars | to indicate when a bar’s worth of timesteps have passed. While this does not explicitly affect the music, I was interested in analyzing the structure of bars in output music.

^d2 <c3 <d5 <f5 <a5
>c3
... 21 more lines here, in the case of 6 ts/qn ==> 24 ts/bar in 4/4
w
|
w
...

III. Initial Results

After extensively training GPT-2 on our dataset in both text representations, we generated outputs.

A generated excerpt from the timestep-independent representation method.
A generated excerpt from the on/off representation method.

Both models of representation show some promise. The timestep-independent method tended to result in music that was more “keyboard-smashy” – as one of our team members dubbed it – likely because it was learning to play a lot of notes from the training data (recall that even notes in a sustained chord are “played” at each timestep). By constrast, the on/off method tended to result in music that was more “laid back”, but would still play odd notes and seems disinclined to play quick melodies. Out of curiosity, we explored two other ways of training GPT-2.

III.1. Classical pretraining

Classical music is more available and tends to be highly structured, both harmonically and rhythmically. We were curious to see if training GPT-2 on classical music first before introducing it to our jazz dataset would improve its structure. Here’s an excerpt (take from the beginning of the article):

It’s not entirely clear if this is an improvement, and if so, what kind of improvement it may be. Certain parts do sound more pleasant, but it seems to have difficulty sticking to one idea. What we could analyze, however, is the rhythmic content of the output compared to the input.

Here is a frequency graph showing where in the bar a note is activated, which serves as a proxy for rhythmic stress. The following is from our classical dataset:

Rhythmic content for classical music is concentrated on straight eighth notes.

As we would expect, stress is placed most heavily on eighth note downbeats and offbeats. By comparison, here is the same analysis for our jazz dataset:

Rhythmic content for jazz music is concentrated on swung eighth notes.

Straight-ahead jazz is swung, meaning that eighth notes are played like eighth note triplets. For this reason, we see secondary peaks 2/3 of the way through a quarter note (1 qn = 6 timesteps in this representation). However, let’s take a look at the rhythmic analysis on a sample output from GPT-2:

It’s not entirely clear how rhythm is distributed in generated outputs.

If it’s not immediately apparent to you what’s going on here, neither was it for us. We see a clear preference for the first downbeat, but the rest is muddled. Also recall that the model is placing its own measure bars down, so some bars were longer than they should have been. This analysis corroborates our qualitative listening experience, which is that the model needs a stronger sense of rhythm. Maybe it should spend some more time with the metronome.

III.2. Solo generation

We noticed that the generated samples tended to be somewhat lacking in melody, so we were naturally curious to see if GPT-2 could generate solos if explicitly trained on them. The solos were taken from this helpful jazz solo database. Since these solos are generated independently of any chord changes, we didn’t investigate this approach too rigorously, but the results were fun to listen to anyway. Here is a clip of me accompanying a generated solo with piano, bass, and drums!

IV. Things left to be desired: Challenges and next steps

To conclude, I thought it might be helpful to lay out some of the challenges involved in generating artificial jazz in particular, and lay out some possible improvements.

Challenges

  • High-quality jazz MIDI is limited compared to available classical music datasets. For that reason, our dataset is relatively small compared to the ideal size for a deep learning ML project. This is partly what led us to experimenting with a pretrained model like GPT-2.
  • Jazz is a very diverse genre so our dataset included blues, bebop, ballads, bossa nova, etc. Ideally we would train our models on only one genre to avoid “confusion”, but this would limit the size of our dataset even further.
  • Compared to its classical counterpart, jazz piano is rhythmically complex and varied even within the same tune. For that reason, we can’t afford low resolutions like the 4 ts/qn commonly used for classical music.

Improvements

We strongly felt that a good jazz generation model should understand the separation and relationship between harmony and melody. We developed a simple algorithm for separating melody from harmony in our traning data, but we haven’t yet built the proper architecture to learn both in parallel. We believe that generated music with a clearly identifiable melody and harmony will be perceived as more musical.

Additionally, with recent improvements in audio-to-MIDI transcription technology, we may be able to compensate for our small database by simply making more!

Finally, we felt that we needed more fine-level control in how exactly the model is trained and which loss it is trying to minimize. GPT-2 is impressive, but these details cannot be fine-tuned at will. What metrics could / should be used to determine the musicality of an input? This question is worth a good deal of thought.

V. Conclusion

The short answer to whether GPT-2 can play jazz is yes. The long answer is, it’s just not designed to take music generation to the extreme; when OpenAI was designing this language model, it’s unlikely that they were thinking about how it could be transferred to learn Bill Evans or Oscar Peterson. For that reason, there are surely model architectures better suited to this problem.

That being said, for a pretrained model with limited musical training data, it’s not bad! It’s good enough to listen to, at least. So, while I never thought I’d join in the hype, I’ll concede just this once to our GPT overlords.

Check out our code on GitHub, Max Woolf’s GPT-2 Colab notebook, and our SoundCloud to listen to more of our experiments. Thank you!

Acknowledgements

Very special thanks to my incredible team members Marcelo Almora Rios (HMC ’21), Sabrina Hartono (CMC ’21), Andrew Shannon (Pomona ‘23), and Robin Yu (HMC ‘24).

This article is a reflection on research conducted at the Claremont College’s AI Incubator, P-ai. Learn more about us here!

--

--

Marcos Acosta
The Startup

Hi! I’m Marcos, a CS major studying ML and computational creativity at Harvey Mudd College. When I’m not at a computer, you’ll find me jamming with friends.