English, Please: Self-Attention Generative Adversarial Networks (SAGAN)

Published in

The Startup

4 min readAug 14, 2020

Introduction

In my effort to better understand the concept of self-attention, I tried dissecting one of its particular use cases on one of my current deep learning subtopic interests: Generative Adversarial Networks (GANs). As I delved deeply into the Self-Attention GAN (or “SAGAN”) research paper, while following similar implementations on Pytorch and Tensorflow in parallel, I noticed how exhausting it could get to power through the formality and the mathematically intense blocks to arrive at a clear intuition of the paper’s contents. Although I get that formal papers are written that way for precision of language, I do think there’s a need for bite-sized versions that define the prerequisite knowledge needed and also lay down the advantages and disadvantages candidly.

In this article, I am going to try to make a computationally efficient interpretation of the SAGAN without reducing too much of the accuracy for the “hacky” people out there who want to just get started (Wow, so witty).

So, here’s how I’m going to do it:

What do I need to know?
What is it? Who made it?
What does it solve? Advantages and Disadvantages?
Possible further studies?
Source/s

What do I need to know?

Basic Machine Learning and Deep Learning concepts (Dense Layers, Activation Functions, Optimizers, Backpropagation, Normalization, etc.)
Vanilla GAN
Other GANs: Deep Convolutional GAN (DCGAN), Wasserstein GANs (WGAN)
Convolutional Neural Networks — Intuition, Limitations and Relational Inductive Biases (Just think of this as assumptions)
Spectral Norms and the Power Iteration Method
Two Time-Scale Update Rule (TTUR)
Self-Attention

First and foremost, basic concepts are always necessary. Let’s just leave it at that, haha. Moving on, a working understanding of the game mechanics of classical GAN training would be quite handy. In practice, I think most versions of GANs now are trained with convolutional layers and a non-saturating or wasserstein loss so learning about DCGANs and WGANs are very useful. Also, the understanding that CNNs have a locality assumption are key to the usefulness of self-attention in SAGANs (or, in general). For the people who get restless without the proof (a.k.a. math nerds), it would be helpful to check out spectral norms and the power iteration method, an eigenvector approximation algorithm, beforehand. As for TTUR, honestly this is just having two separate learning rates for your generator and discriminator models. Feel free to check out the paper on Attention too even though I’ll be mildly going through it.

What is it? Who made it?

Essentially, SAGAN is a convolutional GAN that uses a self-attention layer/block in the generator model, does spectral normalization on both the generator and discriminator, and trains via the two time-scale update rule (TTUR) and the hinge version of the adversarial loss. Everything else is common GAN practice; some of these would be using tanh function at the end of a generator model, using leaky ReLU for the discriminator and just generally using Adam as your optimizer. This architecture was created by Han Zhang, Ian Goodfellow, Dimitris Metaxas and Augustus Odena.

If you looked through the prerequisites, this definition would be pretty straightforward.

Hinge Version of Adversarial Loss Used in the Paper

What does it solve? Advantages and Disadvantages?

To start, an attention module is something that is incorporated in your model to be able to use all of your input’s information (global access) for the output in a not so computationally expensive way. Self-attention is just a specific version wherein your query, key and value vectors are all the same. In the figure below, these are the f, g and h functions. Primarily used in NLP, it has found its way to CNNs and GANs because of the locality assumption that CNNs make. Since CNNs and previous convolution-based GANs use a small window to predict the next layer, complex geometry of certain outputs (ex. dogs, full body photos, etc.) are harder to generate as compared to pictures of oceans, skies and other backgrounds. I’ve also read that previous GANs had a harder time generating images in multi-class situations but I need to read up more on that. Now, self-attention makes it possible to have global access to input information, giving the generator the ability to learn from all feature locations.

⊗ just means matrix multiplication. The first part just shows how the previous layer is converted into three identical pieces (query, key and value) using 1x1 convolutions.

Another thing about the SAGAN is that it uses spectral normalization on both the generator and the discriminator for better conditioning. What spectral normalization does is that it allows less discriminator updates per generator update via limiting the spectral norm of the weight matrices to constrain the Lipschitz of the network function. That’s a mouthful but you can just imagine it to be a more powerful normalization technique. Lastly, SAGANs use the two time-scale update rule to address slow learning discriminators. Typically, the the discriminator starts with a higher learning rate to avoid mode collapse.

Possible further studies?

As of the moment, I’m personally having a difficult time generating 256x256 images due to either computational expense or something I don’t fully understand about the capacity or nuances of the model. Has anyone tried progressively growing a SAGAN?

Thanks for reading! I hope you enjoyed! I would love to do more of these so feedback is very much welcome. :)