Generating small images by adding attention to variational autoencoders

Notes on “DRAW: Deep Recurrent Attentive Writer”

Jason Benn
Paper Club
6 min readSep 16, 2017

--

⁉️ Big Question

“What problem is this entire field trying to solve?”

Generative models may not seem like a useful technology in and of themselves, but building a data generating representation IS super useful. And meaningfully different from a representation that solves supervised problems: the former outputs an image and a label, and the latter outputs a label given an input image. Generative models are perhaps the key to learning from data that has never been seen before (like a car that can imagine car crash situations in order to avoid them), which means that maybe they’re key to AGI. Images are a good target for generative modeling experiments because they’re structurally complex and somewhat easier to debug (than text or sequences of decisions).

🏙 Background Summary

What work has been done before in this field to answer the big question? What are the limitations of that work? What, according to the authors, needs to be done next?

Well, DRAW is a twist on variational autoencoders. Past attempts at image generation involve one pass (Dayan et al., 1995; Hinton & Salakhutdinov, 2006, Larochelle & Murray, 2011), but one-shot approaches don’t scale well to large images.
In response, a bunch of research suggesting that a series of partial glimpses are better for learning image structure rather than one high-level pass came out (Larochelle & Hinton, 2010, Denil et al., 2012; Tang et al., 2013, Ranzato, 2014; Zheng et al., 2014, Mnih et al., 2014; Ba et al., 2014, Sermanet et al., 2014) — DRAW draws (heh) on this by working iteratively, instead of in a single pass.
One other important distinction is that DRAW also incorporates a fully differentiable attention mechanism, which makes it resemble the selective read and write operations of Neural Turing Machines.

What does it mean to have a non-differentiable attention mechanism?

❓ Specific question(s)

What exactly are the authors trying to answer with their research? There may be multiple questions, or just one. Write them down. If it’s the kind of research that tests one or more null hypotheses, identify it/them.

Can we improve image generation tasks by taking variational autoencoders and adding a differentiable attention mechanism?

⚗️ Methods

What exactly did the authors do?

We didn’t learn about variational autoencoders before reading this paper, so I’ll build up an explanation of the net piece by piece.

A standard autoencoder has two ends: an encoder RNN and a decoder RNN. They communicate via a “code”, a series of numbers that encode all training images. One such valid code might be a one-hot encoding. In practice, autoencoder codes are less sparse and use continuous numbers to represent training images. For example, an image of a dog might be encoded by the encoder RNN as [4, 5, 1, 0, 1] (instead of [0, 1, 0, 0, …]). These two RNNs are trained jointly by a loss function that computes the distance between the input image and the output image generated by the decoder. Because the code is lossy, the net will learn to compress like features into similar elements — in the above example, perhaps a basset hound would be encoded as [4.5, 5, 1, 0, 1] and a golden retriever as [5, 4.8, 1, 0, 1]. Dissimilar images will have very different codes — perhaps a cat would have [4, 3, 0, 0, 1] and a submarine would have [0, 0, 3, 5, 6].

Now, the decoder is a convolutional net made of deconvolutional layers, and has learned to transform these codes into images. So you could pull a code out of thin air, say [0, 0, 1, 2, 3], and see what image the decoder generates, but if your decoder has no points of reference near that encoding, it’ll probably generate garbage. To generate a similar enough code that your decoder will create interesting results you’d have to pass every training image to your encoder net, collect those codes, compute a mean and standard deviation for all of your observations for every dimension of your code, and sample that multivariate probability distribution — but that’s a lot of work. It’s much easier to add a term to your loss function that penalizes the encoder net for generating an encoding based on its KL divergence from a unit Gaussian probability distribution. That way, your encoder will learn to output codes that approximately match a unit Gaussian distribution and your decoder will learn to decode them, so you’ll be able to just generate your own codes all day long! No need to do all that analysis on the variation within your training set. This constraint is the only difference between variational autoencoders and regular autoencoders.

A neat trick for introspecting on your network that I learned from kvfrans: keep the generative code constant except for one variable, generate 100 slightly different images, and try to identify a pattern in the results.

DRAW introduces two innovations to VAEs:

One: It learns attentional parameters that constrain which part of the canvas the net reads from and draws to. The decoder net emits them.
TODO: insert Fig 3.

and two: and it iteratively refines its output for several steps. This is achieved by letting the encoder read all of the decoder’s previous outputs, and adding all of the decoder’s outputs together into one final distribution.

This part I don’t understand very well. How does the net decide when it’s done iterating? And why do the attentional parameters always seem to draw smaller and smaller boxes with every timestep?

🤠 Conclusion

What do the authors think the results mean? Do you agree with them? Can you come up with any alternative way of interpreting them? Do the authors identify any weaknesses in their own study? Do you see any that the authors missed? (Don’t assume they’re infallible!) What do they propose to do as a next step? Do you agree with that?

On the MNIST generation with two digits task, the authors concluded that “the network typically generates one digit and then the other, suggesting an ability to recreate composite scenes from simple pieces.” I’m not so sure that I agree. If you watch the video of their generation results, the attention seems to meander across the screen, and when two digits are touching, the attention is more likely to wander into the second digit before finishing the first. How can we then conclude that the network has any idea that it understands the separation between the two?
The authors did not achieve great results on the CIFAR-10 dataset, and blamed these results on CIFAR-10’s small size (50k images) and diverse makeup. I think that’s fair.

👂 Questions

Drop any questions you have or would like to discuss here

  • Are there VAE architectures whose encoder/decoder nets aren’t RNNs?
  • Are there variational autoencoders that _wouldn’t_ be improved by incoporating it into a GAN? Why?
  • What’s the difference between learning a sampler function (GAN), (2) learning the directed Bayes net (DARN) or (3) learning the variational parameters (VAE)?

⏩ Viability as a Project

Is the data available? How much computation? Can the problem be scaled down? How much code development is necessary? How much work to turn this paper into a concrete and useful application? How much will we learn? How do we prove success? What are the results of success?

Very. Because DRAW is made of RNNs on MNIST, I can train the net on my own laptop in ~10 minutes. Additionally, there are tons of existing implementations out there.

🔁 Abstract

Does it match what the authors said in the paper? Does it fit with your interpretation of the paper?

More or less — it describes the architecture simply and states the results fairly.

📚 Other Resources

List any helpful references or citations that helped you understand the paper

🤷‍ Words I don’t know

List and define any and all words you didn’t previously know.

  • Kullback-Leibler (KL) divergence: a measure of the difference between two probability distributions. 0 means they’re nearly the same distribution, and 1 means that the expectation given one is basically 0.

--

--