pix2pix GAN for Generating Maps given Satellite Images using PyTorch

Shashikant Prasad
8 min readJun 8, 2022

--

Introduction

Generative Adversarial Networks (GANs) were first introduced in 2014 by Ian Goodfellow et. al. and since then this topic itself opened up a new area of research.

Within a few years, the research community came up with plenty of papers on this topic some of which have very interesting names :). You have CycleGAN, followed by BiCycleGAN, followed by ReCycleGAN and so on.

With the invention of GANs, Generative Models had started showing promising results in generating realistic images. GANs has shown tremendous success in Computer Vision. In recent times, it started showing promising results in Audio, Text as well.

As we know, GANs are generative models. It is relatively different from other deep learning models, which are meant for classification(i.e., classifying images into certain classes), detection(i.e., detecting different objects/types of things in a given image), etc. In GANs, we have two-component, i.e., Generator and Discriminator. The Generator tries to generate valid samples from a constant noisy input. And the Discriminator is trained to discriminate between the real(good) and fake(invalid) input. GANs are pretty similar to a min-max search. Here, Generator is always trying to deceive the Discriminator by classifying its generated sample as real. And Discriminator always tries to organize the generated samples as fake and data from the training set as real.

Along with this game, Discriminator is also trained on accurate data to understand the difference between actual and generated samples. So after the training of GANs is over, we would come up with a Generator that can generate synthetic images which are very close to real ones.

Fig-1 :- Vanilla GAN Architecture

In GANs, it is important to have a Generator and Discriminator of comparable capability in terms of network capability (architectural capacity). If we develop a GAN with a very powerful Discriminator and a very weak Generator or vice-versa, we won’t be able to train to a good Generator. So, in the end, it is crucial to decide on the architecture of our Generator and Discriminator.

Problem

So, we are trying to generate Maps using satellite images. For this, we are using the Satellite-Map Dataset . It consists of satellite images and their respective map. This satellite image and map have a pixel-to-pixel correlation, i.e., even each pixel in the satellite image is represented in the map. This problem is a bit different from “Style Transfer,” which is another topic of discussion in GANs.

Fig — 2: Sample Image from the dataset, where Satellite Image(left) and Map(right) have a pixel to pixel correlation

For this type of problem, we use “Image-to-Image Translation(I2I)” GANs, and as it is clear from the name, we need pixel-to-pixel correlation between the data for this type of problem. In layman’s terms, we can see this pixel to pixel is nothing but image1 of one kind is the exact representation of image2 of a different kind.

So here we can say Satellite Image and Map are from two different domains, but they are correlated at a pixel-to-pixel level. This pixel-to-pixel correlation makes it a case of image-to-image translation.

General Architecture for I2I GANs

Let’s first talk about Generators,

Unlike “Vanilla GANs,” the Generator is usually a decoder kind of a model, generating a sample of high dimensionality from a low dimensionality noisy input. Firstly our input would be no more prolonged noise. Instead, we would be giving an image from domain1 as input (satellite image in our case). Secondly, our Generator would be close to an AutoEncoder model, i.e., it would have both encoder and decoder in it. This Generator architecture makes sense since adding an encoder in the model would enable Generator to extract the essential features from the image of domain 1. Using this feature-space decoder can recreate an image similar to domain 2.

Fig — 3: Basic GAN architecture for Image-to-Image Translation

Now for the Discriminator, we would first pass the real image of domain one and domain two concatenated above one another to the Discriminator. This help to help Discriminator to understand the ground reality, i.e., train our Discriminator on real samples. Then we would concatenate generated image of domain two and the real image of domain 1. Based on the ability of the Discriminator to distinguish this sample generator, the loss would be backpropagated.

pix2pix is one of such GAN works best for Image-to-Image translation.

Now we would start coding.

pix2pix’s Generator

As we know, the architecture of the Generator in the case of pix2pix GAN is an encoder-decoder one. It is much like a U-Net architecture. In U-Net architecture, encoder and decoder are mirror images of each other. Also, we add a bottleneck layer in between the encoder and decoder.

Fig — 4: Example of a Basic U-Net Architecture (Ref:- https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/u-net-architecture.png)

In our case, we would have six convolution blocks in our encoder. Each of these blocks would double the number filter of its previous block, thereby reducing the image’s dimension by two. E.g., A 64X64 image after one such convolution block would become 32X32. These blocks consist of a convolutional layer followed by a batch-normalization layer and an activation function, specifically LeakyReLU.

After the encoder, we would have a bottleneck layer in which we won’t increase the number of filters. It acts as a channel to pass the encoded features to the decoder.

Now, since the decoder is the mirror image of the encoder, each block size of the image would be getting twice. But, there is a slight chance in the decoder’s input. For the decoder’s input, we would concatenate the output of the previous decoder layer (or convolution block in the decoder) and its respective mirror encoder’s output (represented by the gray arrow in figure 4).

Now, Let’s see the complete code for Generator

pix2pix’s Discriminator

The architecture of pix2pix Discriminator would be pretty much similar to a classifier, where we are trying to classify the given input as real or fake. While providing the input to the Discriminator, we need to concatenate the images from two different domains (i.e., the satellite image and its respective map image).

The research paper suggested using four convolution blocks with some specific numbers of filters, i.e., 64, 128, 256, 512 in each convolutional block. Each “convolutional block” in Discriminator would consist of 1 convolutional layer, 1 batch-normalization layer, and one LeakyReLU layer as an activation function.

Dataset Preparation

Since we are working on Satellite Image to Map generator, the available dataset consists of the image both satellite images and respective map images side by side.

Each image in the dataset is of shape (1200, 600, 3). So first, we need split the image in that format so that the data loader gets the that in (satellite_image, map_image) form. We are also doing some primary augmentation to the input to make our Generator more robust. Boosts are entirely optional.

For this purpose, we would be making a different script as dataset.py.

Training

Now we are all set to train our pix2pix GAN.

First, we need to create objects of our Generator class and Discriminator class and initialize the Optimizer for the same. We would be using Adam optimizer for this. At the same time, we would initialize our loss functions. We would be using Binary Cross-entropy Loss and L1 Loss at appropriate places.

Training a pix2pix is very similar to a vanilla GAN. So, first, we would fetch a satellite image (image from domain 1) and pass it to our Generator. The Generator would produce an image of domain 2(map image), i.e., Generated Map Image or y_fake.

Fig 5:- Image from domain 1 (Satellite Image) given to the Generator

Now, it’s time for our Discriminator to show some glimpse of the real world. We would pass real concatenated images of satellite images and map images to our Discriminator.

Fig 6:- Real data are given to Discriminator to classify them as real or 1.

Now, since the Discriminator has seen some real data. We would calculate the Binary Cross Entropy Loss(BCE Loss) for this. We would try the Discriminator to classify all the real data as 1.

After getting the generated image from Generator, we would concatenate the actual satellite image(image of domain 1) and its respective generated map image (image of domain 2). This concatenated sample would be fed to Discriminator as a fake one. And we would calculate the BCE Loss for the same. Here, our aim is that Discriminator would classify it as 0 since these samples are fake.

Fig 7:- Discriminator is fed with satellite images and generated map images in a concatenated fashion.

Total loss for the Discriminator would equal the average loss for real and fake data. And then, we would backpropagate this loss to the Generator and allow the optimizer to update all its weights.

Now it’s time to train our Generator based on its ability to deceive our Discriminator. The Generator would always want that Discriminator to classify its generated image as real instead of fake. For this, we would again give the fake data to our Discriminator, and this time for the sake of calculating loss for the Generator, we want the Discriminator to classify them as real or 1. We would calculate BCE Loss for the same.

To give our Generator better clarity about the actual image of domain 2 (map image), we would also backpropagate L1 loss between generated map image and the actual map image for the respective satellite image. To better understand L1 Loss, you can refer to this page . So the total loss for the Generator would be the average of the Discriminator’s performance on fake data and L1 loss (i.e., “unsimilarity” between generated map image and actual map image for a satellite image). After that, we can backpropagate this loss and update the weights of our Generator.

So, finally, the whole training function would look as follows:-

Training Results

So after our training is over, we would come up with a generator that can produce a map given a satellite image.

Here are the results of maps generated by our Generator while training.

Fig 8 :- Satellite Image(left), Real Map Image(middle) and Generated Map Image(right) after Epoch 1
Fig 9 :- Satellite Image(left), Real Map Image(middle) and Generated Map Image(right) after Epoch 100.
Fig 10 :- Satellite Image(left), Real Map Image(middle) and Generated Map Image(right) after Epoch 400.
Fig 11 :- Satellite Image(left), Real Map Image(middle) and Generated Map Image(right) after Epoch 800.

We would expect even better results, but I have only trained it for 800 epochs. If one can train it even further, we could have an even more powerful Generator.

Fig 12: Generator Loss Vs. Discriminator Loss during training

Summary

In this post, pix2pix in detail along with its implementation in PyTorch.

For complete code, use this GitHub Repository.

If you have any concerns can comment down below or reach out on LinkedIn or Twitter. I would love to help you out.

References

1. Image-to-Image Translation with Conditional Adversarial Networks

2. DCGANs using PyTorch

--

--

Shashikant Prasad

Student of MTech AI @ IIT Jodhpur. Interested in Computer Vision and Robotics.