Music Transcription using a Convolutional Neural Network

Dhruv Verma
8 min readDec 11, 2017

--

Sheet Music for “On the Nature of Daylight” by Max Richter transcribed using CNNs.

Background

This project was built by Kevin Tai, Rohan Kondetimmanahalli, Kevin Chau and Dhruv Verma. The Github for this project can be found here.

For our project, we developed a convolutional neural network that automatically transcribes piano music pieces. Music transcription is a difficult task that, for humans, requires specialized expertise. We worked on automating the process using an online library of MIDI files to synthesize audio and provide ground truth labels for our network to learn on.

The idea for our project was inspired by last year’s music genre prediction and music synthesis projects. Music transcription is a similar task that could build on top of their prior work. Further research showed us Lunaverus, which is a commercial software capable of music transcription and so we tried to modify their approach.

Our project has several major components:

  1. Data Collection
  2. Preprocessing
  3. Model Creation
  4. Model Training

Data

To start off, we needed to find a large amount of data to work with. Most online datasets for music and sound were raw audio files. Raw audio files have lots of noise which led to us trying to avoid using them as the first step. Instead, we looked for MIDI audio files, which are files in a special format that tell exactly when a note is pressed and how long it is held for. We aimed for piano music as our data source in order to narrow our scope to one instrument. In our search for datasets, we came across a few that had a handful of MIDI files across a few websites. These MIDI files included classical, video game, Christmas, and even international e-Piano Junior Competition music. However, our best dataset turned out to be a public domain of “live” MIDI performances for the Yamaha Disklavier, an electronic piano. This public domain contained over 10,000 piano music files in the MIDI format, which resulted in more than enough data for us to work with including the other MIDI files we found. Our preprocessing step turned these 10,000 files into more than 1 million training examples to feed to our network.

Pre-processing

Our goal was to use MIDI files and convert them into raw audio files, then create a spectrogram for the convolutional neural network (CNN) to use as an input. The spectrogram shows the power of various frequencies in the song over time in an image, and CNNs can be used to understand unstructured image data. To create the spectrograms, we used two techniques: a short-time Fourier transform and a constant-Q transform.

We first split our spectrogram and MIDI file into one second windows to feed as inputs into our CNN. We ended up with 1 million one second sound clips and translated 18,137 of them into spectrograms.

While working with the CNN, we realized that one second was too wide of a window for the output to be stitched back together to recreate the original audio file. This is because the CNN would just play all of the notes it detected within the one second time window at each second, instead of separating those notes into their respective time slots during the one second. An example would be if four notes were played consecutively in one second, then the output would play all four notes at the same time in one second instead of consecutively.

To fix this issue, we thought of a better time slice window for the MIDI file and the spectrogram. With ⅛ of a second, we could play at a granularity of an eighth note for a 120 beats per minute song. This seemed reasonable enough for a song recreation, so we split both the spectrogram and MIDI file into ⅛ second windows. We ended up with 2 million ⅛ second sound clips with 10,517 of them turned into spectrograms. We fed these new inputs into our CNN and retrained it. Since MIDI files are a stream of events tagged with delays, it can be somewhat tricky to perfectly time slice them to ⅛ of a second apiece. As a heuristic, we do our best by taking slices that are at least ⅛ of a second long, although they are allowed to be longer.

Example of what a MIDI file event stream looks like

We used a program called fluidsynth to generate audio waveforms from our split MIDI files and then turned them into spectrograms. However, there are many variants of spectrograms. One variant uses a short-time Fourier transform (STFT), which decomposes small sections of an audio signal into its component frequencies. The result is an image where time increases on the x-axis and frequency values are on the y-axis. However, we discovered some downsides to using the STFT on music. Since each piano keys will generate power at multiples of the fundamental frequency, a linearly-spaced frequency chart will squish most of the information together. This creates a lot of wasted space in the top half of the frequency chart.

Example of the STFT from a 1-second segment of a piano cover of “On the Nature of Daylight” by Max Richter

FAn alternative to the STFT is the constant Q-transform. Similar to the STFT, it transforms snippets of audio into spectrograms. However, it uses logarithmically spaced filters to decompose the signal, which makes for a more evenly spaced spectrogram. We tried feeding both images to our network and found that the constant-Q transform outperformed the STFT for our purposes.

Example of the constant Q-transform from a ⅛ second segment of a piano cover of “On the Nature of Daylight” by Max Richter

Our full preprocessing pipeline generated an absurd amount of data. All told, we were able to generate 2 million ⅛ second segments from our songs. Of those 2 million, we were only able to turn ~20k of them into audio files and spectrograms due to time constraints.

At training/testing time, we convert the segmented MIDI files into “piano rolls” on the fly. Each piano roll is represented as a 0/1 vector with 128 entries. Each entry corresponds to a particular key on the piano, and has a 1 if that note was being played at that moment in time.

Model Creation

The model we used was going to be a CNN because of its high accuracy with images. The other music projects done in the previous semester both used Long Short-Term Memory networks (LSTMs), but since music transcription is time-independent we could just go ahead and use a CNN. We based our initial model off of a CNN tutorial that used MNIST as its dataset. Since MNIST is a categorical problem for the CNN, we had to make some modifications.

We wanted to be able to detect multiple notes at the same time, so we omitted the softmax layer that is usually used for classification at the end of the CNN. This also meant that our loss function could not be a categorical cross-entropy function since our result would not be categorical. Instead, we used a sigmoid layer with a binary cross-entropy loss function. Sigmoid allows for independent probabilities unlike the softmax layer, which is exactly what we wanted.

The exact network architecture we used was similar to what you might find for the MNIST challenge as mentioned previously. The layers we used were:

  • Conv2D-tanh (5x5)
  • Dropout (0.5)
  • MaxPooling2D (2x2)
  • Conv2D-tanh (3x3)
  • Dropout (0.5)
  • MaxPooling2D (2x2)
  • Sigmoid

Post-processing

Post-processing the network output was pretty straightforward. Each spectrogram represented ⅛ of a second of audio. The network turned that audio into a piano roll in MIDI format. We then stitched all of the MIDI files together to create a song and then rendered it using fluidsynth or GarageBand.

Model Training Results

We were able to achieve high accuracy on our model. Part of this is due to the fact that most of the entries in the label vector are 0 at any given time (most keys on the piano aren’t being pressed). With 128 keys, if the song only plays one note per time step, it’s relatively trivial to achieve 99% accuracy. However, for more complicated songs, accuracy can be a decent metric. We also observed our training and test cross entropy decrease over time, which was an encouraging sign.

Accuracy of our network after training for 100 epochs

Qualitatively, the results of our network were pretty decent! Testing on the song “On the Nature of Daylight” gave us the following results:

Original Audio Clip for “On the Nature of Daylight”
Reconstructed Audio Clip for “On the Nature of Daylight”

The network failed to transcribe the left hand, but it appears to have transcribed the melody pretty accurately. There may have been some data issues that we hadn’t realized. More complicated transcriptions gave mixed results and were messier.

Conclusion

We were quite pleased that our network was able to transcribe something meaningful, even for very simple pieces. Most of the performance benefit came from switching from the STFT spectrograms to the constant-q transform spectrograms, which were arguably significantly easier for the network to learn.

There are many things that we could have done to improve performance, but didn’t get to try due to time constraints. One way to increase accuracy for our model would be to use an LSTM in addition to our CNN. Our CNN would be the major contributing factor to decide whether a note should be at a certain location, but the LSTM could help with tie-breakers or reduce noise. The LSTM could view trends in music and show a higher probability for a certain note if the CNN is split between multiple notes at a certain location.

We could also try changing our network architecture to a resnet. Resnets use residual hop connections between layers that allow subsequent layers to just learn the residual between the output of everything earlier in the network and the true labels. In practice, this allows for much deeper networks and better performance in some image processing tasks.

Another thing we could do to improve the model is to do better data preprocessing. Since a lot of our observed modeling benefits came from cleaner data, we suspect that even cleaner data would give us the greatest performance boost. For example, the constant-Q transform helped a lot, but there are other things we could do to help the network better distinguish notes. First, we could try different scalings of the constant-Q transform. The network was unable to accurately detect the left-hand spectrograms, and the reason could be that the lower frequencies are still too scrunched together. We could select a custom scaling that might provide more resolution in the lower notes where we need it.

The last thing that could improve our qualitative result is note onset detection. Note onset detection figures out where the note begins in each audio clip. Most musical instruments produce signals that have high power at the beginning and then die down over time. An algorithm can find these signatures to determine where each note is. This would allow our transcription software to detect note lengths as well.

Overall, we had a good time with the project and there’s plenty of future work that can still be done.

--

--

Dhruv Verma

Musings on AI, Tech and the Future. AI Research Engineer exploring and experimenting with the world in search of wisdom and beauty.