# Artificial Colorization of Grayscale Satellite Imagery via GANs: Part 1

# Introduction

In this blog post we introduce a new method to artificially colorize grayscale satellite imagery.

Our approach uses Generative Adversarial Networks, and thus differs from most other recent deep learning colorization techniques [1,2,3,4,5]. We explore the use of GANs for colorizing grayscale satellite imagery because, unlike other methods, such an architecture generalizes to the similar problem of generating the additional bands found in 8-band multispectral satellite images from their underlying 3-band RGB color images.

We will first motivate why one may want to colorize satellite imagery, and then provide a high-level technical overview of our algorithm.

In an upcoming blog post, we will show that MNC, an object detection algorithm, is better at extracting building footprints in artificially colored images than in their underlying grayscale images. Specifically, we will show that MNC trained on real color satellite images and evaluated on artificially colored satellite images has a significantly higher F1 score for building footprints than MNC trained on grayscale satellite images and evaluated on grayscale satellite images.

# SpaceNet Data

The dataset used in this project is the second SpaceNet dataset, which provides satellite imagery for four different cities (Las Vegas, Paris, Shanghai, Khartoum) with attendant GeoJSON labels for building footprints. The imagery is comprised of grayscale 30 cm GSD, as well as 30 cm 3-band RGB color imagery, and 30 cm 8-band VNIR multispectral imagery.

# Why Colorize Grayscale Satellite Imagery?

Colorizing grayscale satellite images can provide value by:

- Increasing the performance of object detection algorithms like MNC in satellite imagery. Thus, if a grayscale satellite image is colorized, then its potential value for use in machine learning algorithms increases.
- Making digital maps created from grayscale images easier to see with the human eye.
- Augmenting small training data sets of color satellite images with additional artificially colored images.
- Suggesting that one can intelligently compress color satellite images with deep learning. Given a bandwidth-limited remote sensing situation, one could transmit grayscale images and recreate color images from the received grayscale images.
- Providing motivation for generating 8-band multispectral satellite imagery from 3-band RGB color satellite imagery by using a similar algorithm. We will refer to this process as “multispectralizing” color satellite imagery.

# Recent Research on Colorization with Deep Learning

Artificially coloring grayscale images using deep learning has produced several compelling results. *It is important to note that the goal of colorization is not to recover the actual ground truth color, but rather, to produce a plausible colorization that the user finds useful even if the colorization differs from the ground truth color.*

Colorization can seem like a daunting task because so much information is lost (two out of three color dimensions) in converting a color image to its underlying grayscale representation. Because grass is usually green and the sky is usually blue, the semantics of an image scene provide many clues for a plausible colorization. It also seems reasonable that deep learning has the potential to be a successful tool for colorization because it already takes advantage of scene semantics for image classification and object detection.

We quickly describe three recent papers on colorization using deep learning:

- Deep Colorization of Cheng, Yang, and Sheng (2016) describes a neural network for colorization with input a grayscale image and output a color image in YUV color space. This paper formulates colorization as a regression problem, as the loss function for this neural network is framed as a least squares minimization problem between the predicted and ground-truth color pixel values.
- Colorful Image Colorization of Zhang, Isola, and Efros (2016) describes a convolutional neural network (CNN) whose input is a lightness grayscale image (L) and whose output is a distribution over quantized color values in CIE Lab color space, thus framing colorization as a classification task. Lab color space is an alternative system for representing pixel colors versus the standard RGB values. It is useful because the L channel is statistically independent from the pure color a-b channels.

3. Image-to-Image Translation with Conditional Adversarial Networks of Isola, Zhu, Zhou, and Efros introduces conditional adversarial networks as a general-purpose solution to image-to-image translation problems like reconstructing objects from edge maps, converting a daytime photo to a nighttime photo, and even colorizing of grayscale photos. Our approach to colorization has philosophical similarities to Image-to-Image Translation with Conditional Adversarial Networks, but we use somewhat different network architectures.

There are many other great papers on colorization. Many of them can be found in the references of the three papers above.

In the next section we describe our colorization algorithm. Our approach differs from the three colorization papers reviewed above because we seek a colorization algorithm whose architecture is easily generalizable to “multispectralization”: the related task of generating an 8-band multispectral image from a 3-band RGB image. For instance, we are unaware of multispectral analogs of YUV or Lab color spaces.

# Generative Adversarial Networks and Colorization

Generative Adversarial Networks (or GANs) are composed of two neural networks: a generative model G that tries to model the real data distribution, and a discriminative model D that estimates the probability that a data sample is real data rather than fake. By fake data we mean data that is the output of our generative model G.

In our case, think of the generative model G as a colorization algorithm: i.e., as trying to produce a 3-band RGB color image from a 1-band grayscale image. Think of the discriminative model D as trying to model the probability that a color image is real or the result of applying G to a grayscale image.

Given the underlying grayscale image *x* of a training color RGB color image *y*, Deep Colorization would try to minimize the mean squared error of G(*x*) and *y*. Meanwhile, Colorful Image Colorization frames this as a classification problem to create even more vibrant colorizations. In fact, the Colorful Image Colorization algorithm is evaluated using a “colorization Turing test,” asking human participants to choose between a generated and ground truth color image. This method successfully fools humans on 32% of the trials.

A natural next question to ask is if this human Turing test can itself be addressed by a neural network. The discriminative network D in a GAN is defined precisely for this task! It is a neural network whose output is the probability that the discriminator D thinks the input data is real or fake. This gives a training goal for G that is aligned with the human Turing test: the generator network G tries to maximize the probability that D cannot distinguish between real and fake data.

A comparison of Deep Colorization and a GAN approach to colorization leads to a comparison of the following two training categories:

- Training G to minimize the mean squared error of G(
*x*) and*y*where*x*is a grayscale image and*y*is a ground-truth color image. - Training G to maximize the probability that the discriminator network D cannot tell the difference between real and fake color images.

We conjecture that training a GAN is correlated to minimizing the mean squared error between G(*x*) and *y*, but a detailed study of this correlation is beyond the scope of this blog post.

# Our Approach to Colorization

We now describe our approach to colorizing satellite imagery. Our generative network for colorization is similar to Patrick Hagerty’s super-resolution neural network [1,2,3].

## Our GAN Architecture

We first describe the architecture of our generative network G. Our training network takes as input *x* (a 64x64 pixel sliding window of a grayscale image) and is comprised of three outer layers. Each outer layer contains two convolutional layers, one deconvolutional layer, and one ReLU layer. The input to an outer layer is a convex combination of the previous outer layer and the 2nd previous outer layer. A picture of generative network is shown below:

Next, we describe the discriminator network D. Our discriminator network takes an input *y *(a 64x64 sliding window of a real or fake color image)* *and passes this color image through five convolutional layers with 64, 128, 256, 512, 1024 filters respectively. Each convolutional layer is followed by a 2x2 max pooling layer. The last layer of the discriminator network is a fully connected layer (*without* an activation function) whose output is a single number. This single number is interpreted as the probability that the discriminator network D believes the input color image is real or generated (i.e fake). A picture of the discriminator network is shown below:

## Training our GAN

We employ the following to train the networks G and D.

Let *y *be a real color image and let *x* be a grayscale image, so that G(*x*) is a generated (i.e. fake) color image. Then the discriminator tries to maximize the following function:

Note that trying to maximize this function is equivalent to trying to maximize the probability that D assigns the value 1 to real color images and the value 0 to fake color images.

The generator tries to minimize the following function:

Minimizing this function is equivalent to trying to *maximize *the probability that D confuses a fake color image for a real color image; i.e., the generative model is trying to produce images that are as “realistic” as possible through the lens of D.

GAN training is implemented using alternating training steps (with stochastic gradient descent) between the discriminator and the generator.

Since GAN training can be unstable, we attempt to stabilize training by including a regularization term in both the discriminator loss function and the generator loss function. The goal of the regularization term in the discriminator loss function is to keep the discriminator’s weights small, while the goal of the regularization term in the generator loss function is to keep the generated colorizations somewhat close to the ground truth colorizations during early training iterations.

We train four different colorization GANs on Las Vegas, Paris, Shanghai, and Khartoum. We now display the resulting colorizations of these four models on large GeoTIFF images. Each of the four images below are split down the middle with the real RGB color satellite image on the left and the artificial colorization of a grayscale image on the right. If the difference is difficult to see, that means the GAN is producing a realistic colorization.

Note that if one were to *instead* show a real color image and an artificial color image over the same exact area side-by-side, the human eye would try to pick out the differences between the two images. Since the goal of a colorization to produce a plausible colorization, and not necessarily recover the ground truth, this visualization would be counter-productive.

**Warning:** Although the four large images above look very compelling, some of the finer details of these images have non-naturalistic colorings. For example, below is an example image over Las Vegas whose artificial colorization is clearly non-naturalistic: some of the streets have a strange reddish color and the building rooftops are not uniformly colored.

In summary, our colorizations tend to have an airbrushed effect and tend to be somewhat biased towards grayish hues due to an averaging effect.

# Going Further

This blog post represents a first attempt to colorize grayscale satellite imagery via a method that is potentially generalizable to multispectralizing 3-band RGB satellite imagery.

Because training a GAN is computationally expensive, we have several possible directions for improvement.

- We currently train our colorization GAN on 64x64 pixel-sized sliding windows. Training on larger image windows would allow the network to see even more semantic image information (like an entire building rather than just a piece of a building). This would decrease the effect of the airbrushed quality shown above.
- We stopped training on each city at approximately 200,000 iterations (2–3 days of training on our Nvidia DevBox). We may see significant improvement if we allow our networks to train for millions of iterations.
- It is possible that applying a classification approach to colorization (as described in Colorful Image Colorization) would be helpful in creating more vibrant colorizations.
- The methods described to here for generating 3-band color images from 1-band grayscale images should be directly related to problem of generating 8-band multispectral images from 3-band color images. This will be explored in more detail in an upcoming blog post.

**Acknowledgements:** We thank Patrick Hagerty and Adam Van Etten for numerous helpful conversations.