Source: https://github.com/PacktPublishing/Advanced-Deep-Learning-with-Keras/blob/master/chapter3-autoencoders

Building an Image Colorization Neural Network — Part 1: Generative Models and Autoencoders

George Kamtziridis
5 min readAug 28, 2022

--

During my postgraduate studies in Artificial Intelligence, I had the chance to work on Artificial Neural Networks in great detail. One of the assignments was based on Convolutional Neural Networks, where I had the freedom of choosing a problem and then providing a solution. I was, and still am, really interested in generative models, so I picked the image colorization task. In image colorization the goal is to build a model capable of applying realistic color to black and gray images. In this article, I will guide you through the process of creating a generative model which utilizes Convolutional Neural Networks to colorize images.

Before we begin describing the implementation, we need to lay the foundation of it. For this reason, we will split the process into the next 4 parts:

  1. Part 1 (Current): Outlines the basics of generative models and Autoencoders.
  2. Part 2: Showcases the fundamental concepts around Artificial Neural Networks.
  3. Part 3: Presents the basic knowledge of Convolutional Neural Networks.
  4. Part 4: Describes the implementation of the actual model.

Disclaimer: This is not a tutorial in any way. It provides some rudimentary knowledge, but the main goal is to showcase how one can build such a model.

Generative Models

First of all, let’s make sure we have a good understanding of when a model is labeled as generative. Practically, there are two different types of models: the generative and the discriminative. A discriminative model is deemed with the aim of successfully discriminating between different kinds of data samples. The most common discriminative scenario is binary classification, where each data sample has a target value that can belong to one of two distinct classes. Classifying if the image contains a cat or a dog is a basic example. However, discriminative models can, also, make distinctions between more than two classes. Along the lines of the previous example, classifying if the image contains a cat or a dog or a human or anything else you like or even a combination of those (multi-label classification) is also considered to be a discrimination task. Putting in more formally, a discriminative model captures the conditional probability P(Y|X), which can be translated to “when knowing X (features), what is the probability of Y (target value)”.

On the contrary, generative models try to generate new data instances. For instance, a model that generates an image based on a written prompt (like DALL·E) or a model that generates summaries when provided with a full text (like GPT-3) are two popular cases of generative models. Basically, such models can create “new” data based on a prompt and, of course, all the data they have seen during training. A more official definition is that generative models capture the joint probability P(X, Y), where X is the input prompt, or P(X) in cases where there are no labels. The model tries to learn the distribution of the actual data in order to get a “good” understanding of how the training dataset was generated.

There are many types of generative models like Generative Adversarial Networks, also known as GANs, and Variational Autoencoders to name a few. In the image colorization task I chose to build a generative model based on the simple Autoencoder approach.

Autoencoders

The basic principle of Autoencoders is the fact that they consist of 2 distinct parts: the encoder and the decoder. The encoder receives the input and transforms it into a new space, known as latent space, while the decoder gets this new representation and transforms it back to the original space. To express it mathematically, the encoder f transforms data from space X to space F and the decoder g transforms data from space F to space X.

Illustration of the transformation of an Autoencoder

Generally, the latent space is of lower dimension compared to the input, so the encoder performs dimensionality reduction. Through this process the autoencoder is forced to keep the most valuable information from the initial space and, thus, capture essential dependencies. But how do we measure the performance of an autoencoder? One way of doing that is via calculating the quadratic loss. If q is a data distribution over X, we need to minimize the following term:

Reconstruction error of the Autoencoder

More specifically, if mappings f and g have trainable weights, then we need to minimize the following loss:

Deep Autoencoders

To solve the image colorization problem we must adjust the autoencoder to utilize Neural Networks, since Convolutional Neural Networks work extremely well on computer vision tasks. The result is the Deep Autoencoder in which both encoder and decoder are separate Convolutional Neural Networks. The encoder receives the input image and passes it through a sequence of transposed convolutional layers. During this process, the input gets smaller and smaller until we reach the final transposed layer. At this stage, the image has been transformed into a lower dimensional space. On the other hand, the decoder gets the compressed input and aims to recreate the initial input by passing it through consecutive convolutional layers. If the training set is large enough, the autoencoder can learn to generate new instances by selecting data points from the latent space.

Deep Autoencoder with multiple layers.

However, in the image colorization problem where the input prompt is a black and gray image, we do not want to strictly recreate it, but enhance it in a sense. The images have a shape of (channels, width, height). The input images contain only 1 channel, but their colorized versions contain 3 channels: red, green and blue. So, the autoencoder is deemed to generate 2 channels in total. To simplify this, instead of using the RGB format we can utilize the LAB format, where L indicates the lightness, a the balance between green and magenta and b the balance between blue and yellow. Although LAB uses the same number of elements to illustrate a pixel, the L part is essentially our input image. In other words, by using the LAB format the autoencoder will generate only 2 channels, a and b, considering that the third one, the L, will come from the input.

And that concludes the first part of this series. On the next one, we will take a look at Artificial Neural Networks. Stay tuned!

--

--

George Kamtziridis

Full Stack Software Engineer and Data Scientist at fromScratch Studio. BEng, MEng (Electrical Engineering/Computer Engineering) MSc (Artificial Intelligence)