Image generated by Stable Diffusion

Building an Image Colorization Neural Network — Part 4: Implementation

George Kamtziridis
8 min readSep 19, 2022

Welcome back to the fourth and last part of this series where we will finally implement a neural network capable of applying color to black and white images. In the previous articles, we have covered the basics of generative models and autoencoders, artificial neural networks and convolutional neural networks. If these sound gibberish to you, make sure to check the corresponding articles before you study the following part (links below).

The entire series consists of the following 4 parts:

  1. Part 1: Outlines the basics of generative models and Autoencoders.
  2. Part 2: Showcases the fundamental concepts around Artificial Neural Networks.
  3. Part 3: Presents the basic knowledge of Convolutional Neural Networks.
  4. Part 4 (Current): Describes the implementation of the actual model.

Disclaimer: This is not a tutorial in any way. It provides some rudimentary knowledge, but the main goal is to showcase how one can build such a model.

The entire model was built with PyTorch and the image preprocessing was made with the help of Scikit-Image library. All the code can be found in: https://github.com/gkamtzir/cnn-image-colorization

Data and Preprocessing

Before proceeding with the actual implementation, we need a fairly large dataset containing colorized images. Keep in mind, our approach does not require the corresponding black and white images, because as we have mentioned in the first article, we will utilize the LAB format, which means that we can decompose the images of the training set and get the black and white version of each. The dataset I chose is the Image Colorization Dataset containing 5,000 colorized images for training and 739 images for testing. The dimensions of every image are 400x400x3. Their content varies from images of food, people, animals, vehicles to images of exterior and interior places.

The only preprocessing that took place was the conversion of RGB to LAB format. For this purpose, I have used the Scikit-Image library together with the `Dataset` class PyTorch provides to create the mechanism of reading and loading images. For more details, check the `Dataset.py` file.

Architecture and Configuration

Regarding the architecture of the neural network, we have already mentioned that we will try to implement an autoencoder, where the encoder will consist of convolutional layers and the decoder will contain transposed convolutional layers. The input will be images of 400x400 with 1 channel, the L value (lightness). In the output, we will get a 400x400 image with 2 channels, the a and b values. The final colorized image will be constructed by combining the predicted a and b with the input L.

The architecture of the network was built incrementally, from networks with only a few layers and parameters to networks of many layers with more sophisticated approaches in regards to information flow. Basically, I have followed Occam’s razor principle whereby gradually building more complicated solutions, I tried to reflect the actual complexity of the task to the complexity of the model.

One thing that we haven’t talked about is batch normalization between layers. I pledge to write an article on batch normalization, but for now you can think of batch normalization as a clever trick that allows us to speed up the training process while taking good care of the weight values.

Batch normalization it’s much more than that. I will explain everything in a following article.

As an activation function, I used the ReLU since it’s one of the safest options. The batch size was set to 32, which translates to training the network with batches containing 32 images combined. The prediction loss was calculated by the Mean Squared Error, or MSE, while in the experiments the network used the Adam optimizer.

In total, I have experimented with 6 different networks. In the subsequent sections, I will provide the settings and the results for all of them. Each was trained on a development set of 200 images for approximately 200 epochs. Development sets are used to test and compare different architectures before feeding the entire dataset to a model and can save a lot of time.

1st Architecture

The first, and the most basic, architecture is the one showcased in the following image. We have run 3 separate experiments with 3 different learning rates, but all of them performed pretty much the same. Also, they weren’t able to learn even the training instances which is a red flag in terms of learning capabilities.

Architecture of 1st Network
Configuration of 1st network
Loss of 1st Architecture
Errors of 1st Architecture
Results of 1st Architecture (left: training set, right: testing set). The number underneath the picture indicates the MSE.

2nd Architecture

The second architecture is essentially an enhancement of the previous one, where I have added 1 convolutional layer and 1 transposed convolutional layer in the encoder and decoder respectively. In this case, the loss was reduced compared to the first case, while the experiment with a learning rate of 0.001 demonstrated better results. Still, the network wasn’t able to colorize images, although it did try to put some color in certain places.

Architecture of 1st Network
Configuration of 2nd Architecture
Loss of 2nd Architecture
Errors of 2nd Architecture
Results of 2nd Architecture (left: training set, right: testing set)

3rd Architecture

The third architecture has not just been an enhancement, but an entire modification. I have adjusted the network to implement the so-called U-Net [1]. The U-Net layout uses previously calculated outputs of the encoder as inputs in the following parts of the decoder. In this way, we make sure the network does not lose any important information. The exact structure of a U-Net can be found below:

U-Net Layout (Source: Wikipedia https://commons.wikimedia.org/wiki/File:Example_architecture_of_U-Net_for_producing_k_256-by-256_image_masks_for_a_256-by-256_RGB_image.png)

With this approach, the network was converging faster with lower errors on both training and testing sets. Additionally, it was the first network that was able to consistently apply specific colors.

Architecture of 3rd Network
Configuration of 3rd Architecture
Loss of 3rd Architecture
Errors of 3rd Architecture
Results of 3rd Architecture (left: training set, right: testing set)

4th Architecture

This architecture was based on the previous one with the addition of one extra layer to both encoder and decoder. The end results showed even more reduced losses in the training set and the outputs contained more colorized areas. Pay attention that the testing loss has been increased. This means that our model overfits on the training set, but this is not a problem at this point because the first step when building a model is to make sure it is able to learn. This is done by letting the model overfit to some extent and then by increasing the training set the overfitting problem usually goes away.

Architecture of 4th Network
Configuration of 4th Architecture
Loss of 4th Architecture
Errors of 4th Architecture
Results of 4th Architecture (left: training set, right: testing set)

5th Architecture

During this stage, I chose to modify the network layout by introducing dilation layers, which are also known as “a trous” layers. There is research indicating major improvements in prediction on cases like ours [2]. Again, the losses were even more reduced and the model was able to colorize images more precisely.

Architecture of 5th Network
Configuration of 5th Architecture
Loss of 5th Architecture
Errors of 5th Architecture
Results of 5th Architecture (left: training set, right: testing set)

6th Architecture

The last architecture is an augmented version of the fifth case, where I added 2 extra layers. The results did not differ much from the previous version, which makes this architecture a good stopping point. To be fair, in some parts architecture 5 was better than 6, but since the latter demonstrated lower overfitting I chose this as the final model.

Architecture of 6th Network
Configuration of 6th Architecture
Loss of 6th Architecture
Errors of 6th Architecture
Results of 6th Architecture (left: training set, right: testing set)

Final Results

The training on the development set for the 6th architecture lasted for about 40 minutes on Google Colab Pro GPUs and for about 2.5 hours on a i5–4690K@3.9GHz CPU. Due to time limitations and GPU availability, I was restricted to using only the CPU for training. That is why the final architecture was trained on 2,000 images and not on the entire dataset. So, I trained the model for 300 epochs, with a learning rate of 0.001 for 3 days. The final results were encouraging since the model was able to not only colorize images it has come across during training, but also images that it has not seen before!

Training Set

Final Results on Training Data I
Final Results on Training Data II

Testing Set

Final Results on Testing Data I
Final Results on Testing Data II

And this concludes the entire series on solving the image colorization problem with neural networks. I really hope you learned a lot in the process and had fun at the same time. This was my first machine learning related series, so be sure there are many more to come in the near future. Until then, keep learning!

References

[1] Wei Yao, Zhigang Zeng, Cheng Lian, Huiming Tang, Pixel-wise regression using U-Net and its application on pansharpening, Neurocomputing, Volume 312, Pages 364–371, ISSN 0925–2312, 2018.

[2] Chen, Liang-Chieh and Papandreou, George and Kokkinos, Iasonas and Murphy, Kevin and Yuille, Alan, DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, IEEE Transactions on Pattern Analysis and Machine Intelligence, PP, 10.1109/TPAMI.2017.2699184, 2016.

--

--

George Kamtziridis

Lead AI / Software Engineer at fromScratch Studio. BEng, MEng (Electrical Engineering/Computer Engineering) MSc (Artificial Intelligence)