Mozart ➡ Beatles

Understanding how Facebook’s new AI translates between music genres — in 7 minutes

Arya Vohra
The Artificial Intelligence Journal

--

Imagine this: your friend’s been bugging you to listen to a song for weeks, even though you told them you don’t like Ed Sheeran, damn it! They continue to pester you, claiming that the “melody is what makes it great”. If only you could listen to that melody in the form of something civilised, like a Bach Organ Concerto.

Wait no more.

Facebook’s AI research team proposed a system for a music domain transfer system, which claims to enable translation across “musical instruments, genres, and styles”. You can see the results below.

check it out!

I was shocked, this is pretty damn impressive stuff.

The works of the paper improve on the previous developments in two spaces: domain transfer and audio synthesis. Recent progressions in the domain transfer space have been consistent in their inclusion of cycle-consistency (hehe), like StarGAN (Choi et. al, 2017), CycleGAN (Zhu et. al, 2017), DiscoGAN (Kim et. al, 2017), NounGAN (doesn’t really exist, but these authors need to be more adventurous with their network names!). The key aim of using a cycle-consistency loss is to encourage the network to keep all content-related information, and focus on changing domain-related information.

Hmm, domain, content, GAN, tricycle… All a bit of a jumble of words. Let’s break some of this down. Cycle-consistency conceptualises the following statement: F(G(X)) ≈ X, that a function G(X) should have a corresponding inverse F(X) that approximately returns the input X. This can be encouraged via the introduction of a cycle-consistency loss, as can be seen below:

This can be seen below:

Basically, taking the error across all the forward cycle consistency x → G(x) → F(G(x)) ≈ x and the error across all the backward cycle consistency y → F(y) → G(F(y)) ≈ y. Great!

Now for domain-related information vs content-related information… This one is a bit of a toughie. In the context of GANs, domain information is everything in a given input that determines its fit into its domain, while content information is everything else about the image. For example, if we have an image of a car like this:

And we have a set of domains that include cars like this: {red cars, blue cars, green cars}, we conclude that all domain-related information in the image is the redness of the car, while things such as the shape of the car, the number of headlamps, the backdrop, etc. are all content-related info.

Seeing as the FAIR team’s model was not cycle-consistent, we may have just wasted some time looking into that, but hey, at least we learned something.

The FAIR team’s model was not cycle-consistent, due to use of teacher forcing — let’s take a bit of a detour to see what this means in practice.

Teacher forcing is a form of reinforcement learning (an ML technique that stems from research in the psychology space, pretty cool shit. Learn more here). During training time, the model inputs consist of the previous timestep’s ground truth outputs. The sequences seen during training are the ground truth, thus accurate, but that may not be the case for the generated samples. Thus, the sequence of generated samples distant from the sequences seen during training.

Although, if they really wanted to, they could probably implement a cycle consistency loss factor, as in Kaneko et. al. Kaneko et. al didn’t use an autoregressive model, which has some pretty interesting implications that I’ll touch on later.

The team also employed one decoder for each output domain, as a single decoder apparently failed to perform convincingly on the range of output domains.

Coming to the debatably more interesting section (I still can’t get over those pretty dilation diagrams in the paper, let me be!), the FAIR team used WaveNet. Specifically, an adaptation of the team behind the NSynth dataset’s variant of WaveNet. The differences in the FAIR system are: the use of multiple decoders, a disentangled domain confusion network, and the use of pitch augmentation to stop the network from lazily memorising the data.

Fig 1. The WaveNet adaptation from “Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders” by Engel et. al. Pretty, no?
Fig 2. The actual model that the FAIR team used.

Let’s take a look at that domain confusion.

“Domain-Adversarial Training of Neural Networks” (Ganin et. al 2016) described highly effective domain transfer — they achieved state of the art performance at the time — based on the following principle: “for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains”.

The team used adversarial training to do this. The WaveNet autoencoder was the generator, and a domain classification network was the discriminator. Adding the adversarial terms to the loss of the autoencoder (check out the below equation) encourages the autoencoder to learn domain-invariant latent representations. This is what enables the single autoencoder, which is one of the key things that makes this paper cool.

There’s quite a lot to look at in the equation above — let’s break it down quickly.

  • L( y, y) is the cross-entropy loss applied element-wise to each y^ and target y separately.
  • Decoder Dj is an autoregressive model that is conditioned on the output of E, the shared encoder.
  • O(s^j , r) is the augmentation function applied to the sample with a random seed r.
  • C is the domain confusion network, which is trained to minimise a classification loss.
  • λ : lambda is responsible for disentanglement. Disentanglement ensures that all the neurons in the latent representation are learning different things about the input data. This is a key feature of disentangled variational autoencoders, which are explained well in this video on variational autoencoders from Arxiv insights (skip to this timestamp to learn specifically about disentanglement)

Phew, I think this is starting to come together now. To close, let’s take a look at how they trained the thing.

The domains they trained represented a spread of 6 different timbres (timbre: the unique sound of a particular instrument) and textures (texture: number of instruments and notes being played simultaneously) in classical music. One of the results that particularly stood out to me was the correlation between between autoencoder trained embeddings and pitch — the cosine similarity between instrument pairs for the same pitch was in the range of 0.90–0.95, which is pretty remarkable.

So there it is! A damn good step forwards in the music space for deep learning. I am really looking forward to seeing where this drives the community. I hope you enjoyed this first article, follow us here at the AI Journal!

--

--