Image to Image Translation using cGAN

Raghav Chugh
7 min readOct 15, 2021

--

A Deep Learning Case Study

Introduction

Image-to-image translation is the controlled conversion of a given source image to a target image. Examples include translating a photograph of a landscape from day to night or translating a segmented image to a photograph, synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.

Conditional adversarial networks are general-purpose solutions to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations.

Just as GANs learn a generative model of data, conditional GANs (cGANs) learn a conditional generative model. This makes cGANs suitable for image-to-image translation tasks, where we condition on an input image and generate a corresponding output image.

Architecture

The GAN architecture consists of a generator model for outputting new plausible synthetic images, and a discriminator model that classifies images as real (from the dataset) or fake (generated). The discriminator model is updated directly, whereas the generator model is updated via the discriminator model. As such, the two models are trained simultaneously in an adversarial process where the generator seeks to better fool the discriminator and the discriminator seeks to better identify the counterfeit images.

For our generator we use a “U-Net”-based architecture and for our discriminator we use a convolutional “PatchGAN” classifier, which only penalizes structure at the scale of image patches. Both generator and discriminator use modules of the form convolution-BatchNorm-ReLu

Objective function

GANs are generative models that learn a mapping from random noise vector z to output image y, G : z → y

In contrast, conditional GANs learn a mapping from observed image x and random noise vector z, to y, G : {x, z} → y

where G tries to minimize this objective against an adversarial D that tries to maximize it, i.e. G∗ = arg minG maxD LcGAN (G, D).

It is beneficial to mix the GAN objective with a more traditional loss, such as L2 distance. We’ll using L1 distance rather than L2 as L1 encourages less blurring:

Our final objective is:

Business Problem

Given a satellite image, we have to translate it into a corresponding aerial map image.

Success Metric

Model should generate clear, non-pixelated map images from given satellite images such that a person cannot distinguish between the model generated output and manual output.

Business constraints

  • It should be time-efficient. The solution should not take hours to run.

Dataset

Data Overview :

  • Source: http://efrosgans.eecs.berkeley.edu/pix2pix/datasets/
  • There are 2 dataset files in maps.tar.tz i.e train and val.
  • Train folder contains 1096 images while val contains 1098 items.
  • Each image contains a satellite image and its corresponding map image.
  • Images were sampled from in and around New York City

Exploratory Data Analysis

We are given with train and val folder. Train folder contain 1096 images while val consists of 1098 images. We load up the data and perform analysis on it to get some insights.

An image contain 2 fragments: satellite and aerial map. For our purpose, satellite fragment is input and aerial map is output. So, we’ll be splitting our given image into two.

Each image is of size 1200x600 (width x height) which consists of both satellite and map image as well. So, after splitting image size will be 600x600

After splitting, we’ll resize images to 256x256 from the original size of 600x600.

Data Distribution of satellite images

We’ll plot histogram for the pixel values of train satellite images. Since there are 3 channels in our image i.e. RGB, there will be 3 lines in the plot; each corresponding a channel.

Each pixel is in a range [0,255]. We observe that most of pixels for satellite are concentrated in region 40 to 50 and graph is skewed t right.

Next, we plot same graph for val dataset

We can observe that both the distributions (train & val)are similar in nature and Validation dist. is also skewed to right.

Before we start off with modeling, we’ll normalize our data in range of [-1,1].

Implementing cGAN

Discriminator

The discriminator is a deep convolutional neural network that performs image classification. Specifically, conditional-image classification. It takes both the source image (e.g. satellite photo) and the target image (e.g. Google maps image) as input and predicts the likelihood of whether target image is real or a fake translation of the source image.

It implements PatchGAN model which is based on the effective receptive field of the model, which defines the relationship between one output of the model to the number of pixels in the input image. It is designed such that each output prediction of the model maps to a 70×70 square or patch of the input image.

The discriminator structure is as follows:

The model takes two input images that are concatenated together and predicts a patch output of predictions. The model is optimized using binary cross entropy, and a weighting is used so that updates to the model have half (0.5) the usual effect.

Generator

The generator uses a U-Net architecture. It comprises of a encoder, a bottleneck layer and decoder. It first downsamples or encodes the input image down to a bottleneck layer, then upsamples or decodes the bottleneck representation to the size of the output image. There are skip-connections between the encoding layers and the corresponding decoding layers, forming a U-shape.

The structure is as follows:

The tanh activation function is used in the output layer, meaning that pixel values in the generated image will be in the range [-1,1].

After initializing both generator and discriminator, discriminator trainable is set to false as the generator is not going to be trained directly, both will be combined into a single model and trained. This allows the generator to understand the discriminator so it can update itself more effectively.

We also used label smoothing technique on our positive labels.

Label smoothing, the act of replacing “hard” values (i.e., 1 or 0) with “soft” values (i.e., 0.9 or 0.1) for labels, often helps the discriminator train by reducing sparse gradients.

This technique was proposed for GANs in Salimans et al. 2016.

Label smoothing is usually most effective when only applied to the 1’s of a y-data, which is then called “one-sided label smoothing”.

We first train discriminator with the batch of real images and then fake images. Next, the generator model is updated using the real source images as input and providing class labels of 1 (real) and the real target images as the expected outputs.

We’ll be implementing two models: cGAN(with adversarial loss) and cGAN(adversarial + L1 loss)

Basic cGAN

This generator is updated with adversarial loss. The model is trained for 200 epochs with batch size of 1. We save and evaluate the generator model every 15 epochs. Here are the the results:

As you can see, even after 150 epochs of training, the generator is producing blurry outputs. Let’s try adding L1 loss to it.

cGAN (with L1 loss)

This generator is updated with weighted sum of adversarial loss and L1 loss. We use the ratio of 100:1 as suggested in original paper. The model is trained for 200 epochs with batch size of 1. We save and evaluate the generator model every 20 epochs. The results look as follows:

Epoch 20 result is much better than cGAN with Epoch 150. Though Image is a bit wavy.

Image quality is getting better.

The cGAN with combination of adversarial and L1 loss is giving good results.

Conclusion

  • cGAN with adversarial loss is giving blurry results.
  • cGAN with combination of adversarial and L1 loss is producing clear images and generated images are as good as target images.

References

  1. AppliedAI Course
  2. Generative Adversarial Networks
  3. Understanding GAN Loss Functions
  4. Implementing A GAN in Keras
  5. Image-to-Image Translation with Conditional Adversarial Networks
  6. How to Develop a Pix2Pix GAN for Image-to-Image Translation
  7. Pix2Pix:Image-to-Image Translation in PyTorch & TensorFlow

Note: This case study has been done using Python jupyter notebook and Kaggle notebooks. All the code is available at my github.
If you have any queries or wish to connect, you can find me on Linkedin

The project is live on heroku. Click here to try it out. (Note: It may take time to load the website if you’re accessing it for first time)

--

--