Image generated by Stable Diffusion

Building an Image Colorization Neural Network — Part 3: Convolutional Neural Networks

George Kamtziridis
6 min readSep 12, 2022

Hello and welcome back to the third part of this series where we are attempting to colorize black and white images with a neural network. If you haven’t already checked the first 2 parts in which we analyze the basics of autoencoders and artificial neural networks, make sure to do so before moving on (links below).

The entire series consists of the following 4 parts:

  1. Part 1: Outlines the basics of generative models and Autoencoders.
  2. Part 2: Showcases the fundamental concepts around Artificial Neural Networks.
  3. Part 3 (Current): Presents the basic knowledge of Convolutional Neural Networks.
  4. Part 4: Describes the implementation of the actual model.

Disclaimer: This is not a tutorial in any way. It provides some rudimentary knowledge, but the main goal is to showcase how one can build such a model.

In the preceding article we have demonstrated what an artificial neural network is, how it works as well as how we can train it to solve a particular problem. In the current article, we will provide answers to the exact same question, but for convolutional neural networks.

Convolutional Neural Networks

The basic form of artificial neural networks works really well in cases where the input data are “structured” with a relatively “small” number of dimensions. However, there are input data such as images that are considered to be “unstructured”, while they contain an excessive amount of dimensions. Let’s consider the scenario where we have colored images of 256x256 pixels. The dimensions of the input would be 256x256x3 = 196,608, because we have 3 channels: red, green and blue. If we create a simple linear neural network, which receives these images as input and produces images of the same size, we would need 196,608² parameters! It should be clear by now that such networks would need an extraordinary amount of memory, not to mention the immoderate training times.

Although an image can be represented as a very high dimensional vector, it has some special characteristics. The first one is called locality and essentially means that objects in images do have local spatial support. That is, neighboring pixels look pretty similar, or they are correlated in some way. The second one is called translation invariance, which indicates that objects’ appearance is usually independent of their location. For example, a human face looks the same regardless of its positioning on an image (i.e. top, right, bottom, left).

How can one exploit these properties to build manageable neural networks? To incorporate the idea of locality in a neural network, we have to adjust the operations that take place between layers. More specifically, we need to shift away from calculating the product between explicit neurons to utilizing rectangular areas, also known as receptive fields or kernels, that pass over the input while performing the operation of convolution. Do note that convolution between the input and a receptive field leads to a smaller image in terms of dimensions. Since this might be quite hard to digest at first, the aforementioned paradigm shift is demonstrated in the following figures:

Dot product between neurons and input
Dot product formula
Convolution between receptive field and input
Convolution formula

Introducing the translation invariance assumption in the established process can be painless by forcing the weights wᵢ of a receptive field to be shared across the input.

With that being said, a Convolutional Neural Network or CNN is a specific type of ANN where the fundamental operation between layers is convolution instead of multiplication. The convolution occurs between receptive fields and the input, where the number and the size of the receptive fields is configurable and the weights are trainable. In CNNs, we usually use different kinds of receptive fields for different reasons. For instance, there are certain fields applied to images that detect edges or blur or sharpen the image. However, in the general case the weights of the receptive fields are initialized and then trained just like the weights of the neurons are trained in a conventional feed-forward neural network. Just to make it clearer, in a face recognition task, some receptive fields may end up identifying eyes, while others noses or hair or anything else. The combination of all those gives us a complete face recognition model.

Configurations

Now that we have explained the intuition behind receptive fields, let’s take a look at the different ways they can be applied to an input. Practically, there are 4 configurable parameters that affect the convolution process.

Padding

The padding handles the size of a frame containing zeros which can be added around the input. By using padding, we can better control the output resolution, because the receptive fields will pay more attention to the edges of the image. In a sense, padding helps preserve the input resolution.

Padding example

Stride

The stride handles the step size of the receptive field when applying convolution. A stride of 1 means the receptive field will be moving 1 step at a time, while a stride of 2 indicates a 2 step movement over the input. Stride assists to reduce the spatial resolution for better processing.

Stride example

Pooling

Similarly to stride, there is another technique called pooling that has the same purpose, that is speed up the processing without losing valuable details. There are many variations of pooling, such as average pooling and max pooling. In the first case, we compute the average input over a receptive field, which is the same as applying a k x k strided convolution with weights fixed to 1/k². In the second case, we calculate and keep the max input over the receptive field.

Dilation

The dilation handles the expansion of the receptive field. A dilation of 1 means that a 3x3 receptive field will remain 3x3, while a dilation of 2 means that the same field will be transformed into a 5x5 field, since it was expanded by adding “holes” between the weights. Dilation tries to emulate large receptive fields while maintaining fewer weights. For example, a 3x3 field with dilation 2 has 9 trainable weights (with 1 channel), while a 5x5 field with dilation 1 has 25 trainable weights.

Dilation example

Transposed Convolution

Convolutional layers are very helpful in discriminative tasks where the input is “unstructured”, like an image. However, in our task we have to build a generative model that creates new images. To put it more formally, in the image colorization task we have to increase the dimensions to achieve our goal. We can do that with transposed convolutions which are the exact opposite of convolution. A convolutional layer decreases the resolution, while a transposed one increases it. Their role is crucial for generative models and autoencoders, since the decoder is an amalgam of transposed convolutional layers.

Conclusion

To sum up, in order to train a CNN, with or without transposed layers, one must specify the number of layers and for each layer determine the number of receptive fields, their size together with the padding, stride and dilation options. Then, the network is trained just like in simple ANNs, with the difference that in CNNs we adjust the weights of the receptive fields.

After reading through this part, you will have all the basic knowledge you need to understand the image colorization attack plan. That’s what we are going to do in the next, and final, article of the series. So, stay tuned!

--

--

George Kamtziridis

Full Stack Software Engineer and Data Scientist at fromScratch Studio. BEng, MEng (Electrical Engineering/Computer Engineering) MSc (Artificial Intelligence)