Image-to-Image Translation Using Conditional DCGANs !!!

Udith Haputhanthri
the-ai.team
Published in
8 min readJun 5, 2020
Fig 1: From the paper “Image-to-Image Translation with Conditional Adversarial Networks”¹

Introduction

Image-to-Image translation is one of the most exciting areas in the fields of image processing, machine vision. There are different deep learning architectures/ loss functions that have been implemented by targeting different specific Image-to-Image translation tasks such as,

  1. Image Segmentation
  2. Color synthesizing from Edges
  3. Grey Scale to Colored conversion
  4. Season Translation
  5. Motion Transfer
  6. Deep Fake Generation

But the main drawback of these methods is, they are capable of targeting only a single and specific task. This is where the Generative Adversarial Networks comes into play.

In the paper “Image-to-Image Translation with Conditional Adversarial Networks”¹, The authors have proposed a General-Purpose solution for image-to-image translation using Conditional GANs. This article is mainly based on the concepts introduced in the above-mentioned paper, “Image-to-image Translation with Conditional Adversarial Networks” and also the knowledge I have obtained through the implementation of “ Image to Image Translation using cGAN”

The below areas will be covered in this article.

  1. A gentle introduction to conditional GAN
  2. Advancements of the general cGAN for Image-to-Image Translation (from the paper)
  3. Simple “Image to Image Translation” Implementation using pix2pix dataset

More details about “Generative Adversarial Networks”(GANs) can be collected from my previous article here.

Background

You may have heard about the pix2pix platform (if not check out here) where anyone can draw some shape and turns that shape into a real cat. The pix2pix software has been released with the above-mentioned paper. This has been one of the most exciting papers because of the concept of general-image-to-image translation. The architecture/ methods proposed by authors can be used for most of the inter-domain translations and have been performed significantly better role when compares to the baseline methods introduced for the particular task.

GANs to Conditional GANs (cGANs)

When it comes to Generative Adversarial Networks, the main drawback is the features of the generated entity can not be controlled. Therefore the main target of the Conditional Generative Adversarial Networks is to overcome this issue.

Deeper Into cGANs

In order to control the entity generated, the GAN architecture should be capable of being conditioned. To fulfill this target, the architecture and the objective functions of the conventional GAN should be changed as follows.

Architecture :

Fig 2: Conditional Adversarial Net in the paper called “Conditional Generative Adversarial Nets”⁴

In the cGAN architecture, the Generator and the Discriminator networks remain the same as in GAN but the only difference between both architectures is the introduction of the term “y” to cGAN, which gives specific conditions to the generating entity.

Objective Function :

Fig 3: GAN Objective Function from paper⁴
Fig 4: Conditional GAN Objective Function from paper⁴

And also when comparing the objective functions of GAN and cGAN it is clear that the only difference in cGAN is the condition term y. For the image-generation task, the Objective Function of cGAN can be studied from two opposing perspectives,

From the Generator Perspective :

The generated image of Generator conditioned on y will be G(z|y) (z is the noise which gives the randomness). The Generator tries to generate many realistic images conditioned under the given y while fooling the Discriminator at the same time. Therefore the generator will be trained to maximize the term D(G(z|y)) which minimizes the Objective Function.

From the Discriminator Perspective :

Discriminator will be trained to classify the generated images and real images. Note that in this case, generating realistic images only will not be sufficient. The role of the Discriminator should be related to the condition (y) and when classifying realistic and generated images.

So the discriminator will be trained in such a way that,

  1. Maximize the term D(x|y), When real images with corresponding conditions are fed,
  2. Minimize the term D(G(x|y)), When generated images with the conditions (used to generate the image) are fed

Image-to-Image Translation — Using PyTorch

How Exactly ??

When considering the image-to-image translation task, it can be achieved by using conditional Deep Convolutional GAN with the condition (y) as the base image/ image in the 1st domain.

In the paper, authors have used several other advancements/ Changes of the cGAN architecture to get better results. They can be summarized into,

  1. U-Net Architecture for the Generator

The generator consists of a U-Net architecture which is nothing but an encoder-decoder architecture using skip connections.

Encoder-Decoder architecture can be used to compress the image to lower dimensional space and then generated it back with desired properties. The main drawback of this kind of encoder-decoder architecture is, when it comes to the deeper layers, the low-level features of the base image will be diminished. By adding the skip connections by connecting the starting and ending layers of the generator will improve the generator’s capability of generating images using the low-level features of the base image.

Fig 5: U-Net architecture⁵

2. Patch Discriminator

They have used the concept of Patch Generator instead of the conventional Generator of cGAN. What a patchGAN basically does is classifying the patches of the image as fake or real instead of classifying the whole image at once and averaging all responses to provide the final fake/ real label of the image.

By using patch discriminator, high-frequency correction for the generated image can be achieved because of the attention given for structures of local image patches.

3. L1 loss in the objective Function

New Objective Function :

Fig 6: Objective Function Used in “Image-to-Image Translation with Conditional Adversarial Networks”¹

where,

Fig 7: L1 Loss between y and G(x, z)

Adding the L1 Loss to objective function will change the objective of the generator. According to the equation, now the generator should generate a realistic image conditioned on the image (y) and also the L1 distance between the generated image and the conditioned image should also be minimized.

This will make the training of the generator such a way that the generated images are more likely to the conditioned images. Therefore low-frequency correction of the generated image can be achieved by adding the L1-Loss for the objective function.

4. Dropout layers instead of random noise (z)

They have dropped the input noise to the generator and instead of that “dropout layers” are introduced to give the randomness for the generator in the training and the testing.

Task

In this article, I have mainly focused on the task of “Image Colorization”. Other than the main task, I have tried the procedure for “CityScape Segmentation” using the “pix2pix-cityscapes” dataset and “Map generation from Satellite images” using “pix2pix-maps” datasets also.

In the same way, any translation between 2-domains can be achieved using the same procedure by using the appropriate dataset.

Fig 8: Objective is colorizing the edges-image of the shoe

Dataset

Below code, snippet shows the downloading and the preparation of the dataset. “pix2pix” dataset from Kaggle is used.

Code 1: Creating DataLoder Objects

Model Architecture

Discriminator, Generator classes are implemented with the helper classes of conv_block and transconv_block.

Code 2: Defining Generator and Discriminator

helper classes are defined as follows.

Code 3: Defining Discriminator

Discriminator and Generator model architectures can be visualized as follows.

Fig 9: Designed Generator and Discriminator architectures visualized using Tensorboard

Training

Training is done similarly to conventional GANs.

Code 5: Training

Results — Edge2Shoes

For the evaluation, edge images of shoes are used from test data to obtain colorized images.

Code 6: Evaluation of Generator on previously unseen Test edge-images

These are the results that I obtained from the above architecture.

Fig 10: Generated colored images from the above architecture: below line contains the actual shoe and the upper line shows the generated colorized shoe.

When considering the results, even though they are not perfect, it can be seen that the details such as shoe lines, lightening conditions, color changes of the shoe, etc are contained in the generated images. By tunning hyperparameters and changing the model architecture, increasing the time given to training, much better results can be obtained. Experiments are done with both L1, L2 losses in the main loss function.

Results — Map Generation and Cityscapes Segmentation

Results for the tasks “Map Generation” and “Cityscapes Segmentation” obtained from the train set are shown below.

Fig 11: Map generation from Satellite View
Fig 12: CityScapes Segmentation

When observing the above results, it can be clearly seen that changing network architectures and hyperparameters, Using Patch Discriminator instead of Conventional Discriminator will lead to much better results as shown in paper¹.

All the results, trained model weights, notebooks can be found in the repository mentioned below.

Drawbacks

It can be seen that, in order to do the image-to-image translation between 2 domains using this method, a one-to-one mapped dataset between 2 domains is essential. Which means we should have labeled pairs of images of edges and corresponding colored images.

Therefore this method can not be used in the explained manner when it is difficult to obtain a one-to-one mapped dataset between 2 domains.

Conclusion

The architecture of Conditional GAN is much important with respect to the generative aspects of deep learning because it overcomes the unconditioned nature of the conventional GAN. By using conditional GANs, the Authors of the paper have achieved landmarks in the general-image-to-image translation domain.

There are numerous recent papers/ architectures which are published based on the conditional GANs and to overcome the drawbacks explained above. There are many advanced architectures such as StarGAN, CycleGAN, StackGAN, etc which have been introduced based on conditional GANs.

Feel free to check out my GitHub repository. Any comments, suggestions, and advice are greatly appreciated ❤️

Reference

[1] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros. “Image-to-Image Translation with Conditional Adversarial Networks” (2017)

[2] Alec Radford & Luke Metz, Soumith Chintala. “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” (2016 Jan)

[3] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair†, Aaron Courville, Yoshua Bengio. “Generative Adversarial Networks” (2014 Jan)

[4] Mehdi Mirza, Simon Osindero. “Conditional Generative Adversarial Nets” (2014 Nov)

[5] https://lmb.informatik.uni-freiburg.de/research/funded_projects/bioss_deeplearning/unet.png

--

--