NSynth Reconstruction Pitch Analysis Replication

Created as part of the final project for CS 4824 Machine Learning at Virginia Tech

David Thames
6 min readMay 9, 2019

In the paper “Neural Audio Synthesis of Musical Notes
with WaveNet Autoencoders,” which describes a system of replicating sounds using the WavNet autoencoder and the NSynth dataset, the authors perform a pitch and quality comparison of reconstructed sounds in order to quantitatively judge WavNet [1]. In order to perform this analysis, they trained classifiers to label pitch and quality based on the annotations in the NSynth dataset. I set out to reproduce these results.

The WavNet model is very intensive to train, “10 days on 32 K40 GPUs,” so I used the provided pretrained weights. Similarly, the classifiers trained to classify pitch and quality were very intensive, but these weights were not provided, so I had to find a new way to classify these. While quality is hard to define quantitatively (See Appendix B of the WavNet paper), pitch is a quantitative measurement [1]. Instead of using a neural net classifier, I calculated the fundamental frequency using autocorrelation and converted it to the MIDI pitches used by the dataset. I was unable to use the Baseline model due to errors in running the save embeddings function.

I found much lower accuracies from the original and reconstructed audio when compared to the annotations, but a much smaller difference between original and reconstructed accuracies. This suggests that there is a bias in the annotations compared to the true fundamental frequency, but confirming the high-quality results of the WavNet model.

Original Results

Methodology

The main quantitative comparison used in the WavNet paper is the classification accuracies of pitch and quality the original and reconstructed audio files [1]. They used “a multi-task classification mode” to label the pitch and quality of a given audio file [1]. These classifiers were trained using the annotations in the NSynth dataset which contain a MIDI pitch value and a set of qualities for each audio file.

This test was based on the inception score method commonly used in generative models for images but tuned to work in the context of audio. This is meant to judge the quality of the samples produced by the generative model.

Results

Their results found some degradation in the WavNet reconstructions, but significantly better accuracy the Baseline reconstructions. The original results are shown in Figure 1. They use this to confirm that “the WavNet reconstructions are of superior quality.”

Figure 1: Table from the WavNet paper showing the accuracy of the pitch and quality of the original audio compared to the WavNet and Baseline reconstructions based on the annotation classifiers [1].

Reproduction Methods

Testing Dataset

The audio test set used was the test set from the NSynth Database. This provided a large (4,096 audio files) high-quality testing dataset that was not used in the training of the WavNet model.

WavNet Model

I looked into the possibility of retraining the WavNet model in order to recreate the model myself as well but found that it would be impractical. On the GitHub site for the model, they list it as taking “10 days on 32 K40 GPUs” to train. Because of this, they provide their pretrained weights, and I decided to use those.

After cloning the magenta GitHub repo and installing Tensorflow GPU and Magenta GPU, I was able to run the nsynth_generate script to reconstruct files using WavNet on my GTX 1070 GPU. I left this running for a few days in order to reconstruct the testing dataset.

Baseline Model

I intended to also create and compare reconstructions from the Baseline model as was done in the paper; unfortunately, I was unable to collect results on the Baseline. The baseline provided the following error when run both on Windows and Linux with and without GPU.

File “H:\Libraries\Documents\GitHub\magenta\env_magenta_win\lib\site-packages\magenta\models\nsynth\baseline\models\ae.py”, line 69, in get_hparams
hparams.update(config_hparams)
AttributeError: ‘HParams’ object has no attribute ‘update’

I eventually found that the error seemed to correlate to what should be tf.contrib.training.HParams.override_from_dict instead of tf.contrib.training.HParams.update, but this led to other errors related to GPU memory allocation without any clear solutions.

E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 6.24G (6700197888 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

There is a currently open issue on GitHub related to this, linked here. It seems like the issue is related to versions, but I was unable to find any combination of library versions that would solve this issue.

Pitch Calculation

I used the code from the endolith/waveform_analysis GitHub repository, linked at the bottom, to calculate the fundamental frequency [3]. I tried a number of different methods of calculation including zero-crossings, FFT, autocorrelation, and harmonic product spectrum, and found that the autocorrelation was the only one the produced consistently reasonable results.

I then calculated the MIDI pitch from this frequency by progressively calculating the frequency pitch from 0 until I found the closest frequency. This calculation is based on the formulas provided by USNW [3].

Accuracy Calculation

Each calculated MIDI pitch is compared to the original annotated pitch from the MIDI dataset. It is assigned either true or false for being the same or not, and then the percentage of true values is calculated. This is done for both the original and reconstructed audio.

Finally, I compare the accuracies from the original and reproduced results using a z test on the binomial data to show whether or not there is statistical significance to 95% confidence between the two results.

Reproduced Results

Figure 2: The table showing the reproduced results. With a 51.6% pitch accuracy for the original audio, and a 51.3% pitch accuracy for the WavNet reconstruction.

Clearly, there is a statistically significant difference between the results with the original paper, each with p < .00001. Although the actual percentage scores are far lower than the original paper, the original audio and WavNet reconstructions scored very similarly, there was only a 0.6% decrease in accuracy from the original audio compared to the 13.1% decrease in the paper’s results as seen in Figure 3 [1]. This is likely due to the MIDI pitch annotations not lining up exactly with the fundamental frequency over the whole audio clip or other biases such as this in the different ways the pitch is determined.

Figure 3: Graphs of accuracies of the original and replicated results. The graph on the left contains the direct accuracies, and the graph on the right is normalized to remove error on the original audio in order to better visualize the difference between the original and reconstructed audio.

Links

References

[1] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan. 2017. Neural audio synthesis of musical notes with WaveNet autoencoders. In Proceedings of the 34th International Conference on Machine Learning — Volume 70(ICML’17), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. JMLR.org 1068–1077.

[2] endolith 2019. Functions and scripts for analyzing waveforms, primarily audio. This is currently somewhat disorganized and unfinished.: endolith/waveform_analysis.

[3] Joe Wolfe. Note names, MIDI numbers and frequencies. Retrieved May 8, 2019 from https://newt.phys.unsw.edu.au/jw/notes.html

--

--