Text-to-Image Synthesis

One of the most challenging problems in the world of Computer Vision is synthesizing high-quality images from text descriptions. No doubt, this is interesting and useful, but current AI systems are far from this goal. In recent years, powerful neural network architectures like GANs (Generative Adversarial Networks) have been found to generate good results.[1] Samples generated by existing text-to-image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts.[2] Through this project, we wanted to explore architectures that could help us achieve our task of generating images from given text descriptions. Generating photo-realistic images from text has tremendous applications, including photo-editing, computer-aided design, etc.


Text description: This white and yellow flower has thin white petals and a round yellow stamen.

Generated Images:

Figure 1: Example of generated images from the text description

Data set Description


Our results are presented on the Oxford-102 dataset of flower images having 8,189 images of flowers from 102 different categories. The dataset has been created with flowers chosen to be commonly occurring in the United Kingdom. The images have large scale, pose and light variations. In addition, there are categories having large variations within the category and several very similar categories. The dataset is visualized using isomap with shape and color features. [3]

Each image has ten text captions that describe the image of the flower in dif- ferent ways. Each class consists of a range between 40 and 258 images. The details of the categories and the number of images for each class can be found here: DATASET INFO

Link for Flowers Dataset: FLOWERS IMAGES LINK

5 captions were used for each image. The captions can be downloaded for the following FLOWERS TEXT LINK

Example Images from the Dataset

Figure 2: Example images from different classes

Examples of Text Descriptions for a given Image

Figure 3: Sample Text Descriptions for one flower

Architecture Details

The main idea behind generative adversarial networks is to learn two networks- a Generator network G which tries to generate images, and a Discriminator network D, which tries to distinguish between ‘real’ and ‘fake’ generated images. One can train these networks against each other in a min-max game where the generator seeks to maximally fool the discriminator while simultaneously the discriminator seeks to detect which examples are fake:

Where z is a latent “code” that is often sampled from a simple distribution (such as normal distribution). Conditional GAN is an extension of GAN where both generator and discriminator receive additional conditioning variables c, yielding G(z, c) and D(x, c). This formulation allows G to generate images conditioned on variables c.

  • Generative Adversarial Text-To-Image Synthesis [1]

Figure 4 shows the network architecture proposed by the authors of this paper. The paper talks about training a deep convolutional generative adversarial net- work (DC-GAN) conditioned on text features. These text features are encoded by a hybrid character-level convolutional-recurrent neural network. Both the generator network G and the discriminator network D perform feed-forward inference conditioned on the text features. The encoded text description em- bedding is first compressed using a fully-connected layer to a small dimension followed by a leaky-ReLU and then concatenated to the noise vector z sampled in the Generator G. The following steps are same as in a generator network in vanilla GAN; feed-forward through the deconvolutional network, generate a synthetic image conditioned on text query and noise sample.

Figure 4: Network Architecture


This is the first tweak proposed by the authors. The most straightforward way to train a conditional GAN is to view (text, image) pairs as joint observations and train the discriminator to judge pairs as real or fake. The discriminator has no explicit notion of whether real training images match the text embedding context. To account for this, in GAN-CLS, in addition to the real/fake inputs to the discriminator during training, a third type of input consisting of real images with mismatched text is added, which the discriminator must learn to score as fake. By learning to optimize image/text matching in addition to the image realism, the discriminator can provide an additional signal to the generator.

Figure 5: GAN-CLS Algorithm


It has been proved that deep networks learn representations in which interpo- lations between embedding pairs tend to be near the data manifold. Abiding to that claim, the authors generated a large number of additional text embeddings by simply interpolating between embeddings of training set captions. As the interpolated embeddings are synthetic, the discriminator D does not have corresponding “real” images and text pairs to train on. However, D learns to predict whether image and text pairs match or not.

Some other architectures explored are as follows:

The aim here was to generate high-resolution images with photo-realistic details. The authors proposed an architecture where the process of generating images from text is decomposed into two stages as shown in Figure 6. The two stages are as follows:

Stage-I GAN: The primitive shape and basic colors of the object (con- ditioned on the given text description) and the background layout from a random noise vector are drawn, yielding a low-resolution image.

Stage-II GAN: The defects in the low-resolution image from Stage-I are corrected and details of the object by reading the text description again are given a finishing touch, producing a high-resolution photo-realistic image.

Figure 6: Network Architecture of StackGAN

This is an extended version of StackGAN discussed earlier. It is an advanced multi-stage generative adversarial network architecture consisting of multiple generators and multiple discriminators arranged in a tree-like structure. The architecture generates images at multiple scales for the same scene. Experiments demonstrate that this new proposed architecture significantly outperforms the other state-of-the-art methods in generating photo-realistic images. Figure 7 shows the architecture.

Figure 7: Network Architecture of StackGAN++


In this section, we will describe the results, i.e., the images that have been generated using the test data. A few examples of text descriptions and their corresponding outputs that have been generated through our GAN-CLS can be seen in Figure 8. As we can see, the flower images that are produced (16 images in each picture) correspond to the text description accurately. One of the most straightforward and clear observations is that, the GAN-CLS gets the colours always correct — not only of the flowers, but also of leaves, anthers and stems. The model also produces images in accordance with the orientation of petals as mentioned in the text descriptions. For example, in Figure 8, in the third image description, it is mentioned that ‘petals are curved upward’.

Figure 8: Flower images generated using GAN-CLS

This method of evaluation is inspired from [1] and we understand that it is quite subjective to the viewer. Our observations are an attempt to be as objective as possible. The complete directory of the generated snapshots can be viewed in the following link: SNAPSHOTS.

Comments and Conclusions

This project was an attempt to explore techniques and architectures to achieve the goal of automatically synthesizing images from text descriptions. We implemented simple architectures like the GAN-CLS and played around with it a little to have our own conclusions of the results. We would like to mention here that the results which we have obtained for the given problem statement were on a very basic configuration of resources. Better results can be expected with higher configurations of resources like GPUs or TPUs. Though AI is catching up on quite a few domains, text to image synthesis probably still needs a few more years of extensive work to be able to get productionalized.

Important Links

Link to Dataset: DATASET
Link to Additional Information on Data: DATA INFO
Link to Implementation: CODE
Link to Results: SNAPSHOTS
Link to my LinkedIn Profile: LINKEDIN
Link to my Github Profile: GITHUB


  1. Reed, Scott, et al. ”Generative adversarial text to image synthesis.” arXiv preprint arXiv:1605.05396 (2016).
  2. Zhang, Han, et al. ”Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks.” arXiv preprint (2017).
  3. Nilsback, Maria-Elena, and Andrew Zisserman. ”Automated flower classifi- cation over a large number of classes.” Computer Vision, Graphics & Image Processing, 2008. ICVGIP’08. Sixth Indian Conference on. IEEE, 2008.
  4. Goodfellow, Ian, et al. ”Generative adversarial nets.” Advances in neural information processing systems. 2014.
  5. Zhang, Han, et al. ”Stackgan++: Realistic image synthesis with stacked generative adversarial networks.” arXiv preprint arXiv:1710.10916 (2017).