In-depth Summary of Facebook AI’s Music Translation Model
This isn’t the first time, we are addressing Music Translation, wherein you try to convert piano track to guitar. There are many many attempts to do so. But what makes the recent paper [1] by Facebook AI unique, is the way they approached it as an unsupervised learning, and achieved amazing domain generalization.So the model might not have observed a single flute or whistle sample while training, but it can still translate that flute rendition to a piano track.
Now, let’s say we want to convert the audio of a Mozart symphony performed by an orchestra to an audio in the style of a pianist playing Beethoven. I will demonstrate 4 different ways of doing this.
Level-0 : How a novice would approach
A novice like me would be eager to see it as a transcription problem. I would simply find the notes and chords by short time Fourier transform and then play it with the new instrument. Traditional Speech processing offered a lot of approaches such as Eighen Instrument Based instrument generalization [6] , polyphonic transcription [3] followed by decoding, etc. Each instrument has unique note onset and transients in time domain. But the difficulty is that even for a single instrument(spectral envelope) in freq domain, it doesn’t follow the peak pattern at different pitch. And how can we forget sub harmonics in polyphony, etc. All these factors makes the problem very difficult.
Level-1 : How a deep learning scientist would approach
If I am slightly good in music theory, then I might learn music transcription using CNN which generates labels via midi format. Midi is the standard format in which most of the synthesizers, record digital music. Every key pressed and released is an event in Midi. We have datasets like MAPS[7] which can be used for polyphonic piano transcription problem.
Level-2 : How an NLP guy will approach
If I am an NLP guy, I might use the sequence to sequence model [5], but then I will need matching tracks for both behavior and target instruments.
Level-3 : Learning direct translation and domain generalization
If you are really good musician, then you understand the nuances of each instrument cannot be captured in MIDI. This is where the novelty of the paper lies.
Noam Et Al borrows the auto-regressive architecture of wavenet and makes use of it to convert the problem into “what’s the next note ?” kinda problem. This makes it unsupervised.
You can read about wavenets on the official website (https://deepmind.com/blog/wavenet-generative-model-raw-audio/). But essentially wavenets rock, because of dilated convolution along with learned gates which lead to an increased receptive field, and therefore better prediction, and richer latent space of hidden features. These features are the ones which capture the essence of human voice or music, just like feature maps in CNNs (with different architecture).
Now, if you want to learn an auto-regressive model for predicting piano’s next samples, then you will simply learn wavenet encoder and decoder. Here, the encoder will project the previous sequence into a latent space . The decoder then tries to make sense out of the hidden value in latent space, to decode the next value in the sequence.
Wouldn’t it be wonderful, if the model could encode for piano and decode for some other instrument? a class conditional auto-regressive model?
Well, this is what Facebook AI group did. Noem Et Al [1] trains multiple instrument domains on the same encoder and had a different decoder for each class. You would be wondering, how would it help right?
But the shared encoder forces it to learn common features. we still need to tell the model that this was a piano track and not an orchestra. For this we have a domain confusion network, which makes it understand the matching class/domain. The paper mentions this as an adversary, because the common latent space is trying to find out common features, loosing out of uniqueness, while the confusion matrix is trying to segregate the common representation and make them more class conditional. The fight is between commonness and specificity. This also calls for the careful selection of regularization coefficient, so that we get the best of both worlds.
Let’s have a look at the objective function. We pick up a sample sj from domain j, and then do random pitch shift to avoid no-brainier memorization of data by model. In paper, they mention, that this random pitch of -0.5 to +0.5 of half steps is done for samples in 0.25 to 0.5 seconds. This is represented as O(sj, r) where r is random seed. You might be wondering, what’s the big deal here. But those who have worked with google magenta or any temporal generative model, surely know about the curse of imitation. Sometimes, the model starts to behave like a parrot, doing plain memorization of sequence (blatant over-fitting). and that’s why this augmentation and distortion process is very crucial here. That’s why it is important to train the encoder on multiple domains.
We then let this augmented input through the dilated convolutions of wavenet encoder to move into latent space and then back to the original space via the domain-specific decoder Dj to obtain predictions for next values. We compare the predicted next set of values with actual next values with cross-entropy loss. As an adversary or counter-objective, we also have a supervised regularization term which tries to predict the domain, based on feature vector obtained post encoding. They call it domain confusion network.
The inference is pretty straightforward here. We just fix the decoder dj as per our target domain j. So if you feed it orchestra track,then it gives you back, the translation in target domain j (piano,etc). But here’s the coolest part.
If you feed an unseen instrument to the model and follow auto-encoding process, with decoders of instrument j, then it still works approximately, which is super awesome!!! That’s because, it indicates that the encoder has truly generalized the latent representation across the seen and unseen domains. This concept has been at core of many generative algorithms like GANs, Variational Auto-encoder, and I highly recommend you to read it up the references.
That’s all my friends. I have made a super-awesome you-tube video about the same. Do watch it. Show your support to the blog by giving a clap on medium :)
Subscribe to my youtube channel . I publish videos every week on math intuition of recent topics in AI. The link is as bellow : http://youtube.com/c/crazymuse
Youtube Video (based on this blog):
Useful Links
[1] In-depth explanation on the topic : https://www.youtube.com/watch?v=QL_joojCzvs
[2] Demo by Facebook Team : https://www.youtube.com/watch?v=vdxCqNWTpUs
[3] Google Magenta : https://magenta.tensorflow.org/
[4] MAPS dataset : http://www.tsi.telecom-paristech.fr/aao/en/2010/07/08/maps-database-a-piano-database-for-multipitch-estimation-and-automatic-transcription-of-music/
[5] NSynth dataset : https://magenta.tensorflow.org/nsynth
[6] Useful Blog on Midi-python : https://github.com/vishnubob/python-midi
References
[1] Mor, N., Wolf, L., Polyak, A., & Taigman, Y. (2018). A Universal Music Translation Network. arXiv preprint arXiv:1805.07848.
[2] Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., … & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
[3] Sigtia, S., Benetos, E., & Dixon, S. (2016). An end-to-end neural network for polyphonic piano music transcription. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 24(5), 927–939.
[4] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104–3112).
[5] Roberts, A., Engel, J., Raffel, C., Hawthorne, C., & Eck, D. (2018). A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music. arXiv preprint arXiv:1803.05428.
[6] Benetos, E., & Dixon, S. (2013). Multiple-instrument polyphonic music transcription using a temporally constrained shift-invariant model. The Journal of the Acoustical Society of America, 133(3), 1727–1741.
[7] V. Emiya, Transcription automatique de la musique de piano, Ph.D. thesis, TELECOM ParisTech, France, 2008