Automatic Music Transcription — where Bach meets Bezos

9 min readDec 26, 2019

Recently AWS launched DeepComposer, a set of web-based tools designed to help people learn about AI. Along with this, they launched a MIDI keyboard to input melodies. Being the musician I am, this caught my eye. Music is often put in the subset of disciplines that are relatively safe from the all-ending AI takeover; after all, music is creative. How can a computer possibly exhibit creativity? Will AI be the new Stevie Wonder or Jacob Collier?

It turns out DeepComposer is definitely not targeted towards musicians, but there is another team working on products that are. Google’s Magenta team is making big strides in this area of research, and this article will be focusing on one of their products: Onsets and Frames.

Automatic Transcriptions

In February of 2018, Magenta released Onsets and Frames, a model built to convert raw solo piano recordings into a MIDI sequence. For anyone who has ever searched for sheet music, including me, this would be a life-saver. The problem of Automatic Music Transcription (AMT) is already notoriously difficult even for humans because of the polyphonic (having multiple notes at once) nature of piano music.

There are also numerous hurdles that the computer must pass that humans can easily clear. For example, notes played at the same time interact in a complex and intricate way because of the overlap of harmonics in the acoustic signal. When a note is played on a piano, what we hear is a composite of many different resonant frequencies related to the fundamental note. Our ears evaluate all the acoustic information and condense it into a single note, but for a computer, this generates variability in the input signal.

The transcription problem is made even harder by the fact that there is an enormous number of possible outputs; this is designated as a large output space. The AMT problem uses many concepts that are key to a lot of research being done in AI music right now, and it is the focus of this article. How did Magenta go about solving this problem?

The key ideas behind AMT came from this paper published in 2016. It uses a system similar to speech recognition in that it uses acoustic models in conjunction with a music language model. In speech recognition, the acoustic signal alone is usually not enough to accurately resolve ambiguities, so the acoustic signal is used alongside a language model to provide a probability that a certain word is used, given the previous words in the sentence. Music also exhibits speech-like characteristics, so a similar combination is used.

Magenta’s Approach:

Unlike the end-to-end model used in the first paper, Google’s Magenta takes a slightly different approach:

Magenta’s structure, taken from the Onsets and Frames blog post here

This model splits the task of transcription into two parts: note onset and framewise note activation. Previous models treated each frame of audio with equal importance; each frame was individual and independent. However, Magenta took a new approach by making the beginning frame of a note, called its onset, more important. The decay of a piano note starts immediately after the note is pressed, so its onset is both the easiest to identify and the most significant in our perception of music.

It’s all too convoluted

The network works by first using a convolutional neural network (CNN) to analyze the raw acoustic data. A CNN is a special type of neural network that uses alternating convolutional and pooling layers to gain what we humans think of as “context”, and they are primarily used for image classification.

The general structure of a convolutional network, taken from this article about CNNs

Essentially CNNs take input and analyze certain sections of it at a time. Their purpose is essentially to make multi-dimensional data like images — or spectrograms — easier to handle while still preserving the important features. They do this by alternating convolving and pooling layers. Convolving layers analyze little bits and pieces at a time and generate “feature maps” of the data. The pooling layers aim to further reduce the spatial size of the data by only keeping dominant features. If you want a more detailed understanding of CNNs, check out this article!

Forgetful Neural Networks

The next step in Magenta’s process is the BiLSTM or Bidirectional Long Short-Term Memory network. That’s a lot of big words! LSTMs are a special type of recurrent neural network (RNN) that aims to solve one problem: neural networks are forgetful!

A standard RNN contains loops that allow the network to better process sequential information. However, a problem arises when the gap between the needed information and where the network currently is located is too long. If you’ve ever read a long novel or series without your full attention, you will be familiar with this issue: the event Harry’s talking about happened all the way back in the Philosopher’s Stone! A key example would be as follows (stolen from here, an interesting read if you want to know more about LSTMs): towards the beginning of a paragraph it says:

“I’ve lived in France nearly all my life.”

At the end of the paragraph it says:

“I speak fluent ___”

If an RNN were to try and predict what word should be used there, it may not be able to do it! This is due to the consequential information is separated from its dependency. In theory, an RNN should be able to handle these long-term dependencies, but in practice it is impossible. That is, without an LSTM. An LSTM solves this by essentially allowing the neural network to learn what to keep and what to throw away.

A diagram of a typical repeating module in an RNN. It contains only one layer.

Diagram of how an LSTM works. Both diagrams are taken, again, from colah’s blog.

The top line is called the cell state and it’s the key to why LSTMs work well. The cell state is like a conveyor belt and it allows information to flow unchanged across the neural network. The LSTM can change the information in the cell state, but only through carefully controlled logic gates; represented in the diagram by the yellow squares. For example, this cell state could carry the current subject pronouns in use in a piece of writing. When the piece starts referring to a new subject, we would want the neural net to “forget” the old subject pronouns; this is represented by the first sigmoid. We would then want to replace them with the new subject pronouns, represented by the next two yellow squares.

LSTMs work similarly to our own short-term memory. Our short-term memory is like “working” memory in that it only contains information that is key to the task at hand. For example, the current subject pronouns are kept in our short-term memory while we read. In a normal RNN, the concept of short-term memory only lasts for a very short time. The cell state in an LSTM is like our own short-term memory. This is why LSTMs are named “long short-term memory”: they elongate the range of the neural network’s short-term memory so that it can handle long-term dependencies!

LSTMS are used in highly sequential data like essays, movies and, of course, music. They are especially important in music since music is thematically and idiomatically bound together. A BiLSTM, like the one used by Magenta, is an LSTM that runs in both directions, meaning the algorithm is fed the data both front-to-back and back-to-front, which decreases training time.

The Output

In honour of the cell states, let’s refresh our working memory on Magenta’s process.

Right now we just got past the BiLSTM. The next step is the FC Sigmoid layer, which takes the neural net’s prediction and converts it into one of the 88 possible piano keys; this gives us our note onset predictions!

The other part of this network is the framewise note activation. While it sounds intimidating, it’s just figuring out which notes are sounding at any given point in time, hence the term framewise activation. This works similarly to the note onset detector, but it receives additional input from the results of the note onset detector.

This is what makes Magenta’s approach truly unique. Note onsets are the most perceptually important part of music, so it makes sense to give it special heed. In addition, this approach means the neural net doesn’t report any sounding notes that didn’t have an onset! This helps immensely with the “harmonics” problem mentioned earlier. The upper harmonics of a note are much softer than the fundamental, and the note onset detector would usually ignore them. This means the framewise note detector is forced to ignore them as well, effectively solving the harmonics problem!

The framewise detector has an additional FC Sigmoid layer after its own acoustic model, which produces an output that can be concatenated (adjoined) to onset output. It operates using a similar procedure to the onset detector otherwise.

Results

Let’s look at some transcriptions done by Magenta:

It did very well overall! Fugues are very dense and contrapuntal in nature and are very difficult to transcribe due to this. I think it did very well despite that! For more examples visit this website. My personal favourites are under the header “Examples of Current Metric Limitations” because it shows all the progress that has yet to be made!

It’s easy to question the usefulness of this project, but it shows really promising research in many different areas. Not only that, it’s one of the first uses of AI that I see myself using as a musician! This is a tool, while easily tossed away by many, that is immensely helpful to any classically-oriented musician. This product is something I can already envision myself using, and I think many other musicians would agree with me!

LSTMs, CNNs and the concepts used in building this neural network are representative of much of the new research being carried out in the field of AI music! For example, Magenta used a dataset of transcriptions done by Onsets and Frames in order to train their piano performance AI! This shows huge promise in making AI-driven tools for musicians. If you are interested in more about Magenta or about this project, check them out here.

Some applications of AI, like image classification, are all about getting computers to do activities that we can do easily. It’s really interesting that piano transcription is already a fairly difficult task but is made even more difficult by the limitations of a computer. The field of AI music is burgeoning and I’m really excited to dive deeper!

Key Takeaways

AI music is an active field of research and Google’s Magenta is a huge part of it
Transcribing music is hard…
Using many different neural networks, Magenta’s product Onsets and Frames can automatically transcribe piano pieces with a high degree of accuracy
Using CNNs for acoustic models in combination with LSTMs for a music language model ensures effective transcription
Separating the task of transcription into note onset and framewise activation prioritizes more important musical moments, ensuring a musically useful transcription
Many concepts can be used in a variety of other music-related projects!

Will AI take over the music industry? I think this tweet from NYC jazz musician and content creator Adam Neely sums it up:

“I don’t see this AI thing, even though this is clearly in its infant stages, replacing musicians anytime soon, but I do see it potentially, over the next 5–10 years, as a really powerful tool.”

I think that AI is nowhere close to outright replacing musicians, but I believe it’s shaping up to be an extremely powerful partner to us musicians everywhere.

Thanks for reading! Feel free to reach out to me at dron.h.to@gmail.com for feedback or questions!