Autoregressive Networks: the Obscure Kind of Generative Network

Another shocker from the insane world of Deep Learning:

Source: https://avdnoord.github.io/homepage/vqvae/

So it takes a snippet of speech and then translates the snippet of speech using the voice style of another person. The surprising point though of this research is that its able to encode an internal representation of a speech absent the speaking style. Of course, that sounds like voice to text translation. Now it somehow is able to take out a speech style and transpose it elsewhere. Quoted from the above web page:

This behaviour arises naturally because the decoder gets the speaker-id for free so the limited bandwith of latent codes gets used for other speaker-independent, phonetic information.

The approach in the paper uses auto-regressive networks, one of those curiously strange thingamajigs that DeepMind seems to be enamored with. It is the same kind of network as WaveNet.

DeepMind seems to like using a very peculiar kind of network that goes by the name PixelRNN, PixelCNN, WaveNet and ByteNet. These networks are a radical departure from more traditional CNN networks and have characteristics that give them behavior that is remarkably different from other approaches. DeepMind has a paper that contrasts GANs with PixelCNNs that was submitted to ICLR 2017. In this paper the authors argue that the features and behavior of PixelCNNs are different enough that they should be evaluated differently than GANs.

The question then are, what exactly are these ‘Autoregressive Networks’. In a map that I created of DL supervised learning, these networks are a new species entirely:

Source: Deep Learning Playbook

Autoregressive networks behave differently enough that it is sometimes worthwhile to combine in an ensemble these kinds of networks with more conventional ConvNets or Feed Forward (Dense) networks.

These networks were all the rage late 2016 (around the time WaveNet was introduced), but for some reason, the research community hasn’t been too enamored by them. Unlike GANs, it is much harder to find other research groups working on this. I suspect the popularity of the GANs reduces the interest in an alternative generative technique. This is unfortunate since there is value in using a technique that has wildly different characteristics.

The beauty of Autoregressive networks is that the same formulation applies to one dimensional, two dimensional and higher dimensional domains:

What’s very odd about them however is that conventional networks always calculate a kind of similarity between the weights of the network and the input data. So for the more classical network, you have a sum of products. For a ConvNet you have a generalized form of similarity. However, with autoregressive networks, its just a multiplication of probabilities. Specifically, you try to predict the next pixel by multiplying all the probabilities of the pixels that came before it.

The manner in that it works with images just looks contrived, unnatural and thus plain weird. It scans images like an old fashion cathode ray tube:

https://syncedreview.com/2017/07/30/pixelgan-autoencoders/

So start from the top and then predict the current pixel based on the pixels that are scanned before it. If you read the papers, the results for generative models is that they all seem very blocky and digital.

This network is so much in the fringes that it isn’t even mentioned in the Deep Learning book by Goodfellow et. al. Although the idea of autoregressive networks existed prior to Deep Learning book, it wasn’t as prominently used as it is today. There’s also not much tutorials on this, I have found a few though: http://sergeiturukin.com/2017/02/22/pixelcnn.html and https://github.com/tensorflow/magenta/blob/master/magenta/reviews/pixelrnn.md. But, its still difficult to get a good handle around and appreciate this approach.

To summarize though this latest development from DeepMind. The details of this paper are impressive in that underneath the covers its able to create a discrete representation. It may not be obvious to many, but this is indeed a significant step towards bridging the semantic gap between intuition and rationality.

Here’s an interesting paper “Towards Learning to Learn Distributions” that explores the use of PixelCNN in meta-learning.

Update: Even more interesting https://deepmind.com/blog/high-fidelity-speech-synthesis-wavenet/ Probability Density Distillation

Exploit Deep Learning: The Deep Learning AI Playbook