I have always been curious while reading novels how the characters mentioned in them would look in reality. Imagining an overall persona is still viable, but getting the description to the most profound details is quite challenging at large and often has various interpretations from person to person. Many at times, I end up imagining a very blurry face for the character until the very end of the story. It is only when the book gets translated into a movie, that the blurry face gets filled up with details. For instance, I could never imagine the exact face of Rachel from the book ‘The girl on the train’. But when the movie came out (click for trailer), I could relate with Emily Blunt’s face being the face of Rachel. There must be a lot of efforts that the casting professionals take for getting the characters from the script right.
This problem inspired me and incentivized me to find a solution for it. Thereafter began a search through the deep learning research literature for something similar. Fortunately, there is abundant research done for synthesizing images from text. Following are some of the ones that I referred to.
- https://arxiv.org/abs/1605.05396 “Generative Adversarial Text to Image Synthesis”
- https://arxiv.org/abs/1612.03242 “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”
- https://arxiv.org/abs/1710.10916 “StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks”
After the literature study, I came up with an architecture that is simpler compared to the StackGAN++ and is quite apt for the problem being solved. In the subsequent sections, I will explain the work done and share the preliminary results obtained till now. I would also mention some of the coding and training details that took me some time to figure out.
The data used for creating a deep learning model is undoubtedly the most primal artefact: as mentioned by Prof. Andrew Ng in his deeplearning.ai courses, “The one who succeeds in machine learning is not someone who has the best algorithm, but the one with the best data”. Thus, my search for a dataset of faces with nice, rich and varied textual descriptions began. I stumbled upon numerous datasets with either just faces or faces with ids (for recognition) or faces accompanied by structured info such as eye-colour: blue, shape: oval, hair: blonde, etc. But not the one that I was after. My last resort was to use an earlier project that I had done natural-language-summary-generation-from-structured-data for generating natural language descriptions from the structured data. But this would have added to the noisiness of an already noisy dataset.
Meanwhile some time passed, and this research came forward Face2Text: Collecting an Annotated Image Description Corpus for the Generation of Rich Face Descriptions: just what I wanted. Special thanks to Albert Gatt and Marc Tanti for providing the v1.0 of the Face2Text dataset.
The Face2Text v1.0 dataset contains natural language descriptions for 400 randomly selected images from the LFW (Labelled Faces in the Wild) dataset. The descriptions are cleaned to remove reluctant and irrelevant captions provided for the people in the images. Some of the descriptions not only describe the facial features, but also provide some implied information from the pictures. For instance, one of the caption for a face reads: “The man in the picture is probably a criminal”. Due to all these factors and the relatively smaller size of the dataset, I decided to use it as a proof of concept for my architecture. Eventually, we could scale the model to inculcate a bigger and more varied dataset as well.
The architecture used for T2F combines two architectures of stackGAN (mentioned earlier), for text encoding with conditioning augmentation and the ProGAN (Progressive growing of GANs), for the synthesis of facial images. The original stackgan++ architecture uses multiple GANs at different spatial resolutions which I found a sort of overkill for any given distribution matching problem. The ProGAN on the other hand, uses only one GAN which is trained progressively step by step over increasingly refined (larger) resolutions. So, I decided to combine these two parts.
In order to explain the flow of data through the network, here are few points: The textual description is encoded into a summary vector using an LSTM network Embedding (psy_t) as shown in the diagram. Thereafter, the embedding is passed through the Conditioning Augmentation block (a single linear layer) to obtain the textual part of the latent vector (uses VAE like reparameterization technique) for the GAN as input. The second part of the latent vector is random gaussian noise. The latent vector so produced is fed to the generator part of the GAN, while the embedding is fed to the final layer of the discriminator for conditional distribution matching. The training of the GAN progresses exactly as mentioned in the ProGAN paper; i.e. layer by layer at increasing spatial resolutions. The new layer is introduced using the fade-in technique to avoid destroying previous learning.
Implementation and other details:
The architecture was implemented in python using the PyTorch framework. I have worked with tensorflow and keras earlier and so I felt like trying PyTorch once. I really liked the use of a python native debugger for debugging the Network architecture; a courtesy of the eager execution strategy. Tensorflow has recently included an eager execution mode too. Anyway, this is not a debate on which framework is better, I just wanted to highlight that the code for this architecture has been written in PyTorch. You can find the implementation and notes on how to run the code on my github repo https://github.com/akanimax/T2F.
I find a lot of the parts of the architecture reusable. Especially the ProGAN (Conditional as well as Unconditional). Hence, I coded them separately as a PyTorch Module extension: https://github.com/akanimax/pro_gan_pytorch, which can be used for other datasets as well. You only need to specify the depth and the latent/feature size for the GAN, and the model spawns appropriate architecture. The GAN can be progressively trained for any dataset that you may desire.
I trained quite a few versions using different hyperparameters. As alluded in the prior section, the details related to training are as follows:
- Since, there are no batch-norm or layer-norm operations in the discriminator, the WGAN-GP loss (used here for training) can explode. For this, I used the drift penalty with lamda = 0.001.
- For controlling the latent manifold created from the encoded text, we need to use a KL divergence (between CA’s output and Standard Normal distribution) term in Generator’s loss.
- To make the generated images conform better to the input textual distribution, the use of WGAN variant of the Matching-Aware discriminator is helpful.
- The fade-in time for higher layers need to be more than the fade-in time for lower layers. To resolve this, I used a percentage (85 to be precise) for fading-in new layers while training.
- I found that the generated samples at higher resolutions (32 x 32 and 64 x 64) has more background noise compared to the samples generated at lower resolutions. I perceive it due to the insufficient amount of data (only 400 images).
- For the progressive training, spend more time (more number of epochs) in the lower resolutions and reduce the time appropriately for the higher resolutions.
The following video shows the training time-lapse for the Generator. The video is created using the images generated at different spatial resolutions during the training of the GAN.
From the preliminary results, I can assert that T2F is a viable project with some very interesting applications. For instance, T2F can help in identifying certain perpetrators / victims for the law agency from their description. Basically, for any application where we need some head-start to jog our imagination. I will be working on scaling this project and benchmarking it on Flicker8K dataset, Coco captions dataset, etc. Any suggestions, contributions are most welcome.
The Progressive Growing of GANs is a phenomenal technique for training GANs faster and in a more stable manner. This can be coupled with various novel contributions from other papers. Along with the tips and tricks available for constraining the training of GANs, we can use them in many areas.