Implementing StackGAN using Keras — Text to Photo-Realistic Image Synthesis

Rajat Garg
7 min readFeb 18, 2019

--

# Replicating StackGAN results in Keras

Results from StackGAN Paper

“Generative Adversarial Networks (GAN) is the most interesting idea in the last ten years in machine learning.” — Yann LeCun, Director, Facebook AI

Recent development in the field of Deep Learning often makes me believe that indeed we are living in an exciting time. One such Research Paper I came across is “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks” which proposes a deep learning architecture capable of creating photorealistic images from the given text. In this article, we will replicate the results of this wonderful research paper using Keras.

Introduction

First, let’s just state some basic preliminaries terms that will be helpful for us to understand the concept of StackGAN.

Generative Adversarial Networks (GAN) are composed of two models (Generator and the Discriminator) that are alternatively trained to compete with one another. A Generator generates new instances of an object while the Discriminator determines whether the new instance belongs to the actual dataset.

For those who want to know more about GAN, I find this great article from Derrick Mwiti.

Conditional GAN is an extension of GAN where both the generator and discriminator receive additional conditioning variables c that allows Generator to generate images conditioned on variables c.

In case you want to know more about Conditional GAN, I find this great short article from Connor Shorten.

That’s enough of small talks. Let's dive deep into StackGAN. This article is divided into following subsections for better readability:

  • StackGAN: Text to Photo-Realistic Image Synthesis
  • Model Architecture of StackGAN
  • Results from StackGAN research paper
  • Preparation of Dataset
  • Implementation of Stage I of StackGAN
  • Implementation of Stage II of StackGAN
  • Conclusion

At the end of this article, you will have a working model for replicating the results of StackGAN research paper, generating photorealistic images from text. Moreover, you will have a knowledge of how to train StackGAN for your own dataset or problem statement.

All the code will be shared on my Github repository.

This articles is more focused on implementation details rather than the conceptual details of the StackGAN algorithm. However, I will link the relevant sources for a better understanding of the StackGAN algorithm.

StackGAN: Text to Photo-Realistic Image Synthesis

How about I gave you an interesting challenge to design an algorithm to generate photo-realistic images from given text.

Example:

(Text Input): The bird is black with green and has a very short beak

(Output — Generated photo-realistic images):

Input text: The bird is black with green and has a very short beak

The above bird images are not clicked by a camera but, created by a computer using a deep learning architecture called StackGAN. StackGAN is not the first attempt at creating photo-realistic images using text. There are multiple attempts before but they are not able to generate high-quality images from the text description and they fail to contain necessary details and vivid object parts. Stacked Generative Adversarial Networks (StackGAN) is able to generate 256×256 photo-realistic images conditioned on text descriptions.

This raises some important question, “Why StackGAN is able to create such high-dimensional photo-realistic images?”, “What's different in StackGAN?”

StackGAN decomposes the hard problem into more manageable sub-problems through a sketch-refinement process. The Stage-I GAN sketches the primitive shape and colors of the object based on the given text description, yielding Stage-I low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs and generates high-resolution images with photo-realistic details. It is able to rectify defects in Stage-I results and add compelling details with the refinement process.

That’s enough theory for this article but, I would highly suggest you go through the StackGAN research paper. Here is a great article on StackGAN by Nick Barone.

Model Architecture of StackGAN

Model Architecture of StackGAN

The model architecture of StackGAN consists of mainly the following components:

  • Embedding: Converts the input variable length text into a fixed length vector. we will be using a pre-trained character level embedding.
  • Conditioning Augmentation (CA)
  • Stage I Generator: Generates low resolution (64*64) images.
  • Stage I Discriminator
  • Residual Blocks
  • Stage II Generator: Generates high resolution (256*256) images.
  • Stage II Discriminator

Describing each component mentioned above will make this article very big. I would highly suggest you read the StackGAN research paper for more details about each component. The aim of this article is to implement StackGAN using Keras.

Results from StackGAN Research Paper

Example results by our StackGAN, GAWWN, and GAN-INT-CLS conditioned on text descriptions from CUB test set

Preparation of Dataset

We are training our model on CUB dataset. CUB contains 200 bird species with 11,788 images. Since 80% of birds in this dataset have object-image size ratios of less than 0.5, as a pre-processing step, we crop all images to ensure that bounding boxes of birds have greater-than-0.75 object-image size ratios.

Let’s first download the dataset:

https://drive.google.com/open?id=0B3y_msrWZaXLT1BZdVdycDY5TEE — Download the birds.zip file and extract it inside the root directory.

Download the CUB dataset using the command, or directory from the link and extract it inside the root directory —

wget http://www.vision.caltech.edu/visipedia-data/CUB-200-2011/CUB_200_2011.tgz

Also, create two empty folders logs (your model logs will be saved here) and results (generated images from our stage I StackGAN).

Now, your directory should look like:

Root Directory

CUB_200_2011 — more information about the dataset can be found here: http://www.vision.caltech.edu/visipedia/CUB-200.html

char-CNN-RNN-embeddings.pickle — Dataframe for the pre-trained embeddings of the text.

filenames.pickle — Dataframe containing the filenames of the images.

class_info.pickle — Dataframe containing the info of classes for each image.

During training, the discriminator takes real images and their corresponding text description as a positive sample pair. Real images with mismatched text embeddings and synthetic image with their corresponding text embeddings are considered as negative sample pairs.

Implementation of Stage I of StackGAN

Let’s first implement the Stage I of the StackGAN. As shown in the model architecture, Stage I of StackGAN takes input as text, we convert the text to embedding using our pre-trained character level embedding. Then, we give this embedding to Conditional Augmentation (CA) and then to Stage I Generator which gives us low-resolution 64*64 images. We create a representation of the generated image and the original image and concatenate it with the embeddings to train the Stage I discriminator. This is a very very rough description of Stage I GAN. I would highly suggest you to read the research paper for a better understanding of the architecture.

Let’s, first import some libraries

Now, let's write some functions to load our dataset:

Now, we build our Stage I architecture

Now, let’s write a function for our KL loss:

Now, we write some function for saving the generated image after every 2 epochs and saving the logs.

Let’s write the main function to initialize the hyper-parameters and train our stage I StackGAN.

Hopefully, your model would start training. If not, feel free to comment down your errors and I would be happy to help.

Training of model

After the training of Stage I StackGAN is completed, two new files will be created in your root directory, naming stage1_gen.h5 and stage1_dis.h5 which represents the stage I generative model trained weights and Stage I discriminator model trained weights respectively.

You can also see the generated image by your Stage I StackGAN in the results folder and logs in the logs folder.

Once your Stage I StackGAN is trained, let’s now train our Stage II StackGAN.

NOTE: Training of Stage I StackGAN will take huge time. If you don’t have a GPU, I suggest you use Google Colab which gives a Tesla K80 GPU and 12 GB of RAM for free.

Implementation of Stage II of StackGAN

Let’s implement our Stage II StackGAN. The code below is very similar to Stage I StackGAN. It would be very easy for you to understand if you are able to understand the Stage I GAN. If still there is some doubt, comment down and I will be happy to help.

Let’s now define our hyper-parameters and train our model.

Hopefully, now your model would start training. If not, feel free to comment down your problems and errors, I would be happy to help.

Results of Stage II StackGAN

If you get OOM error, reduce the batch_size as your system don’t have enough memory to do the operations.

Conclusion

I hope that now you are able to train your StackGAN model. If there are still some errors, feel free to list them on the comment section and I will try my best to solve them.

StackGAN is a powerful concept and the fact that they are able to create such photo-realistic images from the text is simply incredible.

Github Code: The code repository for this post is written on Google Colab.

Linkedin Profile: You can follow me on LinkedIn as well

In case you love this article, show your love by clicking the Clap icon.

--

--

Rajat Garg

Software Engineer at Microsoft || Helping tech engineers grab their dream offer