Building an Image Colorization Neural Network — Part 4: Implementation
Welcome back to the fourth and last part of this series where we will finally implement a neural network capable of applying color to black and white images. In the previous articles, we have covered the basics of generative models and autoencoders, artificial neural networks and convolutional neural networks. If these sound gibberish to you, make sure to check the corresponding articles before you study the following part (links below).
The entire series consists of the following 4 parts:
- Part 1: Outlines the basics of generative models and Autoencoders.
- Part 2: Showcases the fundamental concepts around Artificial Neural Networks.
- Part 3: Presents the basic knowledge of Convolutional Neural Networks.
- Part 4 (Current): Describes the implementation of the actual model.
Disclaimer: This is not a tutorial in any way. It provides some rudimentary knowledge, but the main goal is to showcase how one can build such a model.
The entire model was built with PyTorch and the image preprocessing was made with the help of Scikit-Image library. All the code can be found in: https://github.com/gkamtzir/cnn-image-colorization
Data and Preprocessing
Before proceeding with the actual implementation, we need a fairly large dataset containing colorized images. Keep in mind, our approach does not require the corresponding black and white images, because as we have mentioned in the first article, we will utilize the LAB format, which means that we can decompose the images of the training set and get the black and white version of each. The dataset I chose is the Image Colorization Dataset containing 5,000 colorized images for training and 739 images for testing. The dimensions of every image are 400x400x3. Their content varies from images of food, people, animals, vehicles to images of exterior and interior places.
The only preprocessing that took place was the conversion of RGB to LAB format. For this purpose, I have used the Scikit-Image library together with the `Dataset` class PyTorch provides to create the mechanism of reading and loading images. For more details, check the `Dataset.py` file.
Architecture and Configuration
Regarding the architecture of the neural network, we have already mentioned that we will try to implement an autoencoder, where the encoder will consist of convolutional layers and the decoder will contain transposed convolutional layers. The input will be images of 400x400 with 1 channel, the L value (lightness). In the output, we will get a 400x400 image with 2 channels, the a and b values. The final colorized image will be constructed by combining the predicted a and b with the input L.
The architecture of the network was built incrementally, from networks with only a few layers and parameters to networks of many layers with more sophisticated approaches in regards to information flow. Basically, I have followed Occam’s razor principle whereby gradually building more complicated solutions, I tried to reflect the actual complexity of the task to the complexity of the model.
One thing that we haven’t talked about is batch normalization between layers. I pledge to write an article on batch normalization, but for now you can think of batch normalization as a clever trick that allows us to speed up the training process while taking good care of the weight values.
Batch normalization it’s much more than that. I will explain everything in a following article.
As an activation function, I used the ReLU since it’s one of the safest options. The batch size was set to 32, which translates to training the network with batches containing 32 images combined. The prediction loss was calculated by the Mean Squared Error, or MSE, while in the experiments the network used the Adam optimizer.
In total, I have experimented with 6 different networks. In the subsequent sections, I will provide the settings and the results for all of them. Each was trained on a development set of 200 images for approximately 200 epochs. Development sets are used to test and compare different architectures before feeding the entire dataset to a model and can save a lot of time.
1st Architecture
The first, and the most basic, architecture is the one showcased in the following image. We have run 3 separate experiments with 3 different learning rates, but all of them performed pretty much the same. Also, they weren’t able to learn even the training instances which is a red flag in terms of learning capabilities.
2nd Architecture
The second architecture is essentially an enhancement of the previous one, where I have added 1 convolutional layer and 1 transposed convolutional layer in the encoder and decoder respectively. In this case, the loss was reduced compared to the first case, while the experiment with a learning rate of 0.001 demonstrated better results. Still, the network wasn’t able to colorize images, although it did try to put some color in certain places.
3rd Architecture
The third architecture has not just been an enhancement, but an entire modification. I have adjusted the network to implement the so-called U-Net [1]. The U-Net layout uses previously calculated outputs of the encoder as inputs in the following parts of the decoder. In this way, we make sure the network does not lose any important information. The exact structure of a U-Net can be found below:
With this approach, the network was converging faster with lower errors on both training and testing sets. Additionally, it was the first network that was able to consistently apply specific colors.
4th Architecture
This architecture was based on the previous one with the addition of one extra layer to both encoder and decoder. The end results showed even more reduced losses in the training set and the outputs contained more colorized areas. Pay attention that the testing loss has been increased. This means that our model overfits on the training set, but this is not a problem at this point because the first step when building a model is to make sure it is able to learn. This is done by letting the model overfit to some extent and then by increasing the training set the overfitting problem usually goes away.
5th Architecture
During this stage, I chose to modify the network layout by introducing dilation layers, which are also known as “a trous” layers. There is research indicating major improvements in prediction on cases like ours [2]. Again, the losses were even more reduced and the model was able to colorize images more precisely.
6th Architecture
The last architecture is an augmented version of the fifth case, where I added 2 extra layers. The results did not differ much from the previous version, which makes this architecture a good stopping point. To be fair, in some parts architecture 5 was better than 6, but since the latter demonstrated lower overfitting I chose this as the final model.
Final Results
The training on the development set for the 6th architecture lasted for about 40 minutes on Google Colab Pro GPUs and for about 2.5 hours on a i5–4690K@3.9GHz CPU. Due to time limitations and GPU availability, I was restricted to using only the CPU for training. That is why the final architecture was trained on 2,000 images and not on the entire dataset. So, I trained the model for 300 epochs, with a learning rate of 0.001 for 3 days. The final results were encouraging since the model was able to not only colorize images it has come across during training, but also images that it has not seen before!
Training Set
Testing Set
And this concludes the entire series on solving the image colorization problem with neural networks. I really hope you learned a lot in the process and had fun at the same time. This was my first machine learning related series, so be sure there are many more to come in the near future. Until then, keep learning!
References
[1] Wei Yao, Zhigang Zeng, Cheng Lian, Huiming Tang, Pixel-wise regression using U-Net and its application on pansharpening, Neurocomputing, Volume 312, Pages 364–371, ISSN 0925–2312, 2018.
[2] Chen, Liang-Chieh and Papandreou, George and Kokkinos, Iasonas and Murphy, Kevin and Yuille, Alan, DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, IEEE Transactions on Pattern Analysis and Machine Intelligence, PP, 10.1109/TPAMI.2017.2699184, 2016.