Attention in GANs

Jiasheng Tang
AI2 Labs
Published in
7 min readMay 26, 2020
source

In 2017, the paper “Attention is all you need” shocked the land of NLP (Natural Language Processing). It was shocking not only because it has a good paper title, but also because it introduced to the world a new model architecture called “Transformer”, which proved to perform much better than the traditional RNN type of networks and paved its way to the state of the art NLP model “BERT”.

This stone cast in the pond of NLP has created ripples in the pond of GANs (Generative Adversarial Networks). Many have been inspired by this and attempt to garner the magic power of attention. But when I first started reading papers about using attention in GANs, it appeared to me there are so many different meanings behind the same “attention” word. In case you are as confused as I was, let me be at your service and shed some light on what people really mean when they say they use “attention” in GANs.

Meaning 1: Self-attention

Self-attention in GANs is very similar to the mechanism in the NLP Transformer model. Basically, it addresses the difficulty of the AI model in understanding long-range dependency.

In NLP, the problem arises when there is a long sentence. Take the example of this Oscar Wilde’s quote “To live is the rarest thing in the world. Most people exist, that is all.” The two words in bold (“live” and “exist”) have a relationship, but they are placed far apart from each other which makes it hard for the RNN type of AI model to capture the relationship.

It is almost the same in GANs, most of GANs use CNN structure which is good at capturing local features and may overlook the long-range dependency when it is outside of its receptive field. As a result, it is easy for GANs to generate realistic-looking furs on dogs, but it may make a mistake by generating a dog with 5 legs.

Self-Attention Generative Adversarial Networks(SAGAN) adds a self-attention module to guide the model to look at features at distant portions of the image. In the pictures below, the picture on the left is the generated image, with some sample locations labeled with color dots. The other images showing the corresponding attention map of the locations. I find the most interesting one is the 5th image with the cyan dot. It shows that when the model generates the left ear of the dog, it not only looks at the local region around the left ear but also looks at the right ear.

Visualization of the attention map for the color labeled locations. source

If you are interested in the technical details of SAGAN, other than reading the paper, I also recommend this post.

Meaning 2: Attention in the discriminator

GANs consist of a generator and a discriminator. In the GANs world, they are like two gods eternally at war, where the generator god tirelessly creates, and the discriminator god stands at the side and criticizes how bad these creations are. It may sound like the discriminator is the bad god, but that is not true. It is through these criticisms that the generator god knows how to improve.

If these “criticisms” from discriminator are so helpful, why not we pay more attention to them? Let’s see how this paper “U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translationdoes this.

The project U-GAT-IT tackles a difficult task — converts a human photo into a Japanese anime image. It is difficult because an anime character’s face is vastly different from a real person’s face. Take an example of the pair images below. The anime character on the right is deemed as a good conversion from the person on the left. But is we put ourselves in the computer’s shoes for a moment, we will see that the eyes, nose, and mouth in the two images are very different, the structure and proportion of the face also changes a lot. It is very hard for a computer to know what features to preserve and what to modify.

a human/anime character pair. source

U-GAT-IT deals with this difficulty smartly — it consults the discriminator. As the discriminator tells the difference between a good anime image and a bad anime image, it must know where to look at. Below is an image showing the heatmap of where the discriminator pays attention.

A heatmap of discriminator’s attention. source

U-GAT-IT uses this discriminator’s attention heatmap as the attention guide for the generator. The generator then knows that the area where discriminator focuses more (e.g. the big eyes) is the area it needs to change more towards anime character, and the area where the discriminator ignores (e.g. the hair, the cloths) is the area it can preserve more from the source photo.

Meaning 3: Attention in the generator

We can also leverage the information in our generator to provide the attention guide. An example is “GANimation: Anatomically-aware Facial Animation from a Single Image”. It aims to drive facial animation through Action Units (AU), which describes the anatomical facial movements. The below image shows how a facial animation is created by changing the AU value that corresponding to smiling.

α is the AU for smiling. source

Although the “attention” word does not appear in the paper title, it is mentioned in the paper’s abstract that they “exploit attention mechanisms that make the network robust”. They achieve this is by letting the generator produce two outputs, one image, and one mask. The mask is actually an attention map indicating which parts of the image need to be modified by the AU and which parts should be intact. The mask is then used to combine the source image and the generated image to form the final result.

The generator produces an image and a mask, and combine into the final result. source

In this way, they guide the model to focus on performing well on those parts affected by AUs. The parts not affected by AUs are allowed to have defects in the generated image, but they won’t damage the quality of the final image as our mask/attention map has helped to filter out those parts.

Meaning 4: Attention for sparse input

There is a Github project AttentionedDeepPaint, which converts a line-art into a colorized one following a reference image. As its name suggests, it uses attention mechanisms in its network structure, and it pays attention to the sparse information in its inputs.

some results from AttentionedDeepPaint. source

What I mean by “sparse information” is that, if you compare a line-art to a grey-scale image, it is obvious that the line-art contains much less content, or in other words, contains very sparse information. This is why an AI model can colorize a grey-scale image much better than colorizing a line-art.

a grey-scale image and its corresponding line-art

AttentionedDeepPaint uses a typical U-Net model structure. In addition to it, at the end of the downsampling part of the U-Net, the extracted information from the line-art is used as the attention map, to guide the model to pay attention to these important lines in the input image rather than the empty space. For more technical details please refer to this paper.

Conclusion

Many papers have shown that attention mechanisms can help GANs to improve generation quality and robustness.

However, attention mechanisms are not the secrete ingredient that you can throw into your GANs soup and immediately improves its taste. To apply it well, it is necessary to be clear of what is the problem we need to solve and where do we need to pay attention to. And do not forget that adding attention module comes with a cost. It makes the model bigger, makes the training slower, and requires more computation resources during inference.

This article is by no means a thorough survey, but hopefully, it gives you a head start on your exploring journey of attention in GANs.

--

--