T2F: Text to Face generation using Deep Learning

Animesh Karnewar
Jun 30, 2018 · 6 min read
Example images generated by T2F for the accompanying descriptions

The code for the project is available at my repository here https://github.com/akanimax/T2F


This problem inspired me and incentivized me to find a solution for it. Thereafter began a search through the deep learning research literature for something similar. Fortunately, there is abundant research done for synthesizing images from text. Following are some of the ones that I referred to.

  1. https://arxiv.org/abs/1605.05396 “Generative Adversarial Text to Image Synthesis”
  2. https://arxiv.org/abs/1612.03242 “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”
  3. https://arxiv.org/abs/1710.10916 “StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks”

After the literature study, I came up with an architecture that is simpler compared to the StackGAN++ and is quite apt for the problem being solved. In the subsequent sections, I will explain the work done and share the preliminary results obtained till now. I would also mention some of the coding and training details that took me some time to figure out.

Data Analysis:

Meanwhile some time passed, and this research came forward Face2Text: Collecting an Annotated Image Description Corpus for the Generation of Rich Face Descriptions: just what I wanted. Special thanks to Albert Gatt and Marc Tanti for providing the v1.0 of the Face2Text dataset.

The Face2Text v1.0 dataset contains natural language descriptions for 400 randomly selected images from the LFW (Labelled Faces in the Wild) dataset. The descriptions are cleaned to remove reluctant and irrelevant captions provided for the people in the images. Some of the descriptions not only describe the facial features, but also provide some implied information from the pictures. For instance, one of the caption for a face reads: “The man in the picture is probably a criminal. Due to all these factors and the relatively smaller size of the dataset, I decided to use it as a proof of concept for my architecture. Eventually, we could scale the model to inculcate a bigger and more varied dataset as well.


T2F architecture for generating face from textual descriptions

The architecture used for T2F combines two architectures of stackGAN (mentioned earlier), for text encoding with conditioning augmentation and the ProGAN (Progressive growing of GANs), for the synthesis of facial images. The original stackgan++ architecture uses multiple GANs at different spatial resolutions which I found a sort of overkill for any given distribution matching problem. The ProGAN on the other hand, uses only one GAN which is trained progressively step by step over increasingly refined (larger) resolutions. So, I decided to combine these two parts.

In order to explain the flow of data through the network, here are few points: The textual description is encoded into a summary vector using an LSTM network Embedding (psy_t) as shown in the diagram. Thereafter, the embedding is passed through the Conditioning Augmentation block (a single linear layer) to obtain the textual part of the latent vector (uses VAE like reparameterization technique) for the GAN as input. The second part of the latent vector is random gaussian noise. The latent vector so produced is fed to the generator part of the GAN, while the embedding is fed to the final layer of the discriminator for conditional distribution matching. The training of the GAN progresses exactly as mentioned in the ProGAN paper; i.e. layer by layer at increasing spatial resolutions. The new layer is introduced using the fade-in technique to avoid destroying previous learning.

Implementation and other details:

I find a lot of the parts of the architecture reusable. Especially the ProGAN (Conditional as well as Unconditional). Hence, I coded them separately as a PyTorch Module extension: https://github.com/akanimax/pro_gan_pytorch, which can be used for other datasets as well. You only need to specify the depth and the latent/feature size for the GAN, and the model spawns appropriate architecture. The GAN can be progressively trained for any dataset that you may desire.

Training details:

  1. Since, there are no batch-norm or layer-norm operations in the discriminator, the WGAN-GP loss (used here for training) can explode. For this, I used the drift penalty with lamda = 0.001.
  2. For controlling the latent manifold created from the encoded text, we need to use a KL divergence (between CA’s output and Standard Normal distribution) term in Generator’s loss.
  3. To make the generated images conform better to the input textual distribution, the use of WGAN variant of the Matching-Aware discriminator is helpful.
  4. The fade-in time for higher layers need to be more than the fade-in time for lower layers. To resolve this, I used a percentage (85 to be precise) for fading-in new layers while training.
  5. I found that the generated samples at higher resolutions (32 x 32 and 64 x 64) has more background noise compared to the samples generated at lower resolutions. I perceive it due to the insufficient amount of data (only 400 images).
  6. For the progressive training, spend more time (more number of epochs) in the lower resolutions and reduce the time appropriately for the higher resolutions.

The following video shows the training time-lapse for the Generator. The video is created using the images generated at different spatial resolutions during the training of the GAN.

Training time-lapse for T2F


The Progressive Growing of GANs is a phenomenal technique for training GANs faster and in a more stable manner. This can be coupled with various novel contributions from other papers. Along with the tips and tricks available for constraining the training of GANs, we can use them in many areas.

Animesh Karnewar

Written by

Creator of T2F, MSG-GAN and FAGAN | Independent Researcher: AI, DL, RL, GANs | Exploratory Gamer | steamid: NranikC