Photo by Franck V. on Unsplash

Machine learning in music transcription

Karol Burdziński
Published in
6 min readDec 3, 2018

--

Machine learning is a hot topic. With all of the available frameworks (TensorFlow, Caffe, Keras etc. ) it’s even easier to start off with your own software based on it. Thanks to the open source community, we already have a lot of pre-trained models that can be used in various ways.

Recently I came across a magenta framework that is built on tensorflow. It’s a research based project exploring the role of machine learning as a tool in the creative process. Looking at the demos they provide I’ve found a model named “Onset and Frames”, which is a pre-trained neural network model to convert raw audio data, like mp3/wav, to midi.

One may ask, how is mp3 different from a midi file? To put it simply, the MP3 file is a compressed audio file. ( If you’re interested in how actually MP3 file structure works, then there’s a pretty good and wide explanation on Wikipedia). On the other hand, MIDI is a file, which doesn’t have any raw audio data. It provides information on such components as pitch, volume, modulation etc. Due to the fact that the original MIDI format was developed for keyboard instruments, the range of sounds in relation to the musical sound of middle C (C4) is five octaves above and below including all halftones (Organ instruments might have more octaves than standard piano. The lowest notes are about one octave below the human hearing threshold. Its overtones, however, are audible).

Ok, but what’s the difference? The sound you hear playing the midi file is created digitally (probably with the device/program you are playing it with). It just uses its data to play a piece similar to a musician playing from sheets. The mp3 is an audio recording.

Let’s get back to the onset and frames. It’s a CNN (convolutional neural network) model that use two detectors (stacks of neural networks) for overall prediction. One is trained for finding every onset frames, which are basically beginning frames of the note. It’s also easier to detect because the amplitude of that note is at its peak (marked with the red rectangle). Another detector is trained to find every frame for which a note is active (yellow rectangle). The final output will depend on the prediction which was presented by both of these stacks.

Middle C sound waveform

Of course underneath it’s way more complicated. If you’re interested in more details, then there’s a great article written by the authors of the onsets and frames model. You can find it here.

There are two distributions of magenta — Python and JS.

The python version installation guide is under the Github link. Unfortunately, I had a few problems with the automatic installation on my macOS ( High Sierra ). Manual installation went pretty smooth but, after trying to transcribe a simple 5-seconds WAV file it gave me a bus error: 10. After figuring out what happened, I decided to try the docker image of magenta and it seemed to work fine. Just remember, that by default, you have 2GB of ram available in docker (Mac OS), so the process might be killed if it exceeds the ram usage. You can change it in docker preferences. Furthermore, if you chose the MP3 file instead of WAV as a source, then you might see something like NoBackendError. To fix that just install the FFmpeg library.

Docker settings

But hey, if the docker image works then why shouldn’t it locally? I’ve checked python, magenta and tensorflow versions installed in the docker container and compared it to my local environment. It seems like the newest tensorflow ver 1.12.0 which is picked up automatically (via pip install tensorflow) doesn’t work with the magenta version I’ve used (0.3.14). Docker image uses tensorflow ver 1.10.0, so after downgrading my local version, everything started to work as intended.

@Edit: At the time I’m publishing this article seems like it was fixed :)

So after installing the magenta, and all its dependencies, just type the following commands in the terminal:

source activate magenta

and then

onsets_frames_transcription_transcribe — acoustic_run_dir=”path to checkpoints” “path to wav/mp3 file”

Just don’t forget to download the latest checkpoints from here.

launching transcription

You’ll probably see a couple of warnings but that’s fine. First two warnings says that one of the functions magenta uses is marked as deprecated, so they will probably fix it in future updates. The last one tells you, that your CPU is able to use the extensions to the x86 architecture instruction set. In my case, it’s the AVX (advanced vector extensions) which speed up the linear algebra operations, namely dot-product, matrix multiply etc. The default tensorflow distribution is built without these extensions. You can check this article if you want to build tensorflow from the source. This may greatly speed up the whole training or inference process. Also, you can use the GPU distribution (tensorflow-gpu and magenta-gpu). Unfortunately, I don’t own a CUDA GPU so I didn’t test it.

Python code usage

Initializing session (lines 11–12), transcribing (13), and saving to midi file (14).

In the initialize session you need to pass the path to the checkpoint ( with full name ) and hyperparameters.
Transcribe audio (line 13) require the session object, a path to the file that will be transcribed, onset and frame threshold. The last two parameters will highly affect the number of notes on the output. The bigger the threshold, the more strict the matching process will be. You can pick the value from 0 up to 1. The default value for both of them is 0.5.

JS Version

You can check the JS version on the glitch. There’s also a code that you can fetch and simply run locally. It works slower than python version but it also gave me slightly different results. I did a little investigation and it seems like the checkpoint for the python model was trained only with the MAPS dataset (~18h of recording) in contrast to JS model checkpoint, which was trained with the MAESTRO dataset (~172h of recording). This blog post shows how the F1 score differs in both cases.

To register the midi data I’ve used the midi to USB converter and for the raw audio recording I’ve connected my piano headphones output to the input of my computer and used audacity.

On a fairly short piece I’ve recorded ( 5 seconds ), the original midi file has 44 notes and the one created by the neural network ( onset & frames threshold set to 0.7 ) has 49 notes. After opening them both in the online sequencer we can clearly see some similarities and differences. Both tracks are in G Major scale and the melody is pretty similar, but we can clearly see some false positives.

original composition
midi created from the mp3 file (marked a few false positives)

You can also check the differences between the original source and the python/js transcription in this video.

In my opinion, the js version sounds a little bit better (python version has some additional false positives around the middle/end part), but it might be just a matter of time as there’s an open issue for python version, to create new checkpoint based on MAESTRO dataset.

Conclusion
Even though a musician will be able to create better transcriptions than this technology right now, I’m still impressed by the results as it might help to create learning materials for an amateur pianist like myself with less effort. You can jump into learning your favourite songs almost immediately.

Feel free to share your thoughts and comments on this!

--

--

Karol Burdziński
nomtek
Writer for

Software developer @ Nomtek, digital/traditional art passionate, amateur musician.