Image generation using a text description: StackGAN

Teach a computer to be an artist

Donnaphat Trakulwaranont
7 min readAug 4, 2020
A painting robin bird (https://youtu.be/E05Ei6kbaJA)

Introduction

Nowadays, many neural networks achieve their task very well in recent years. The most interesting fields of neural networks are computer vision and natural language processing field. Then, we combine computer vision and natural language processing together. First, I will introduce a challenging problem in the combination of computer vision and natural language processing called “Image Captioning” which is a process that generates the image’s text description based on a given image shown as figure below.

Example of Image captioning

Then, our topic is a “Text to Image generation” which is a process that generates the image based on a given image’s text description. These two problems are a reverse problem with each other.

Example of text to image generation

This text to image generation has many advantages and applications such as we can generate a scene from an adventure book, photo editing based on natural language, and computer-aided content creation. This method also can apply in commercials too. For example, design the product based on natural languages such as furniture and clothes.

Outline

In the first section, I will introduce some of the important topics such as Conditional generative adversarial networks(conditional GANs) and Conditioning augmentation. Next, I will introduce StackGAN that combine the conditional GANs, Conditioning augmentation to form the image generation using a text description and Caltech-UCSD-Bird Dataset. Furthermore, I also show the implementation of StackGAN using Python and Pytorch library. Finally, I will discuss what is the important part of StackGAN and Related works.

Conditional Generative Adversarial Networks (Conditional GANs)

Basic Flow of Conditional Generative Adversarial Networks

This network is an extended version of generative adversarial networks. In the typically, Generative Adversarial Nets use only the noise z as input for generator G to generate the fake data G(z) to close as real data distribution based on a job that wants to fool the discriminator and the discriminator D will distinguish between generated data D(G(z)) and real data D(x), but in the Conditional Generative Adversarial Nets, It includes some extra information y such as a class label or information from the other modalities by combine this condition variable into the noise and input the result into the generator to generate the data based on condition G(z, y) and same in the discriminator D, it will separate generated data D(G(z), y) from real data D(x, y) based on condition too. The structure of Conditional Generative Adversarial Nets is shown as the figure above. To complete this conditional GANs, we modify the original objective function of GANs by adding the condition variable into the min-max function as shown in the equation below:

The objective function of conditional GANs

Conditioning augmentation

To training neural networks based on extra information, we will face the problem of lack of condition in some cases. This problem sometimes is a cause of overfitting problems. Then, this method can use to solve this lack of data problem. Conditioning augmentation uses the statistical value to form the new data distribution explained by the equation below:

where ĉ₀ is a conditioning latent variable, μ₀ is a mean value of embedding vector, σis a diagonal value of covariance matrix of embedding vector and ε is a normal distribution N(0,1). By the way, this equation can show as below.

Conditioning Augmentation

another component is this method is a Kullback−LeiblerDivergence that can call as KLDiv. This KLDiv use to be as a loss function for training to make the new approximation conditional distribution as close as to the original conditional distribution. The simplest way to explain this KLDiv is the logarithm difference between original and approximation condition distribution and can be formed as:

where μ(ϕt), Σ(ϕt) is a mean value and covariance of the embedding vector. In this case, our embedding vector is a text embedding vector.

Caltech-UCSD-Bird(CUB) Dataset

This CUB Dataset is a dataset that contains 11,788 images with 200 categories of bird in North America and each image has 10 captions to explain what the bird in the image looks like.

Caltech-UCSD-Bird dataset (http://www.vision.caltech.edu/visipedia/CUB-200.html)

StackGAN

An architecture of StackGAN includes 2 stages of a generative network to generate initial low-resolution images and clarify low-resolution images to high-resolution images respectively.

In the StackGAN can be divided into 2 main parts as Stage-I and Stage-II to generate low-resolution images and refine high resolution as show figure above.

Stage-I StackGAN

This part is just a part of the generation a rough image by using the text description and noise from latent space. First, we embed the text description into text embedding vectorφt and input this text embedding to conditioning augmentation to generate the condition latent variable ĉ₀ and concatenate it with noise z that sample from latent space to form the input for the generator. The result is a low-resolution image that does not good but still has some structure of bird in the image. To optimize this Stage-I, we have to train the discriminator and generator to reduce their loss. Their loss function is LD and LG as shown below:

The loss function for the stage-I discriminator and the stage-I generator

Stage-II StackGAN

After finish generates a low-resolution image in stage-I, we use that low-resolution image from the previous stage as an input for the generator to generate a high-resolution image. For sampling input for generator, we will sample the generated low-resolution image from pG₀ and use conditioning augmentation to sample a condition latent variable like in stage-I and discriminator will separate generated high-resolution image and real high-resolution image based on text description same as the previous step too. For stage-II, they use loss function as define as below:

The loss function for the stage-II discriminator and the stage-II generator

StackGAN implementation

First, download this dataset:
https://drive.google.com/file/d/17pAhiNfpk67brvdVm505IkA0XaCLMTd4/view?usp=sharing
I already include the encode caption and image dataset in this zip file. You just extract it in the project directory.

Dataset

Next, we implement the part of the dataset for training the StackGAN.

Stage-I

Next, we implement the Conditioning augmentation, Stage-I StackGAN architecture include generator and discriminator. The generator composes the upsampling block and convolution layer for generating images. we input as concatenation noise with conditioning latent variables to the generator. The discriminator composes the convolution layer, batch norm layer, and ReLU layer. The discriminator will output as an encoded image and will classify with the condition in the condition classifier later.

Next, we train both generator and discriminator with bird image dataset size 64 x 64 as below. I set up the configuration following the configuration in the paper. First, I set batch size = 64 for each iteration, use ADAM optimizer with initial learning rate = 0.0002, train with 600 epoch, and learning rate decay every 100 epoch.

Stage-II

For the stage-II generator, I define the model following the paper, first, use the convolution layer as image encoder to encode the image from the stage-I generator. Next, add the few residual layers in the model. In this code, I add 4 residual layers, before using the upsampling layer to create the new higher resolution image.

Next is a stage-II model code and training the stage-II model with bird dataset size 256 x 256. I set up the configuration following the configuration in the paper. First, I have to set batch size = 16 for each iteration because I don’t have enough GPU memory for batch size = 64, use ADAM optimizer with initial learning rate = 0.0002, train with 600 epoch, and learning rate decay every 100 epoch same as stage-I.

This is an example of each generated in each epoch.

Conclusion

Finally, we obtain the bird image generation using the text description. In my point of view, I see that the important part of this model is a “Conditioning augmentation” module because we use this module to form the condition distribution. When we have a condition distribution, we will have many images and caption pairs. Then, the model has the ability to learning from various inputs to success in the image generation process. Another technique that use to complete the task is two stages of generative adversarial networks to refine the image from a low-resolution image to a high-resolution image. It seems like you sketch the image, after that, you clarify it and painting to create the best result.

Github code:

Reference

StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

--

--