Generic Model-Agnostic CNN(GMAN) For Single Image Dehazing

Published in

Analytics Vidhya

7 min readSep 30, 2020

Explanation of the paper: “Generic Model-Agnostic Convolutional Neural Network for Single Image Dehazing” by Zheng Liu and implementation of the same in TensorFlow(version 2+).

Introduction

Haze and smog are among the most common environmental factors impacting image quality and, therefore, image analysis. This paper proposes an end-to-end generative method for image dehazing. It is based on designing a fully convolutional neural network to recognize haze structures in input images and restore clear, haze-free images.

Generic data models are speculations of traditional data models. They characterize normalized general connection types, along with the sorts of things that might be connected by such a connection type.

Why this method is ‘Agnostic’, let’s answer that.
Until now all the state-of-the-art(SOTA) methods proposed explores the Atmosphere Scattering(explained below) model. GMAN network is agnostic in the sense that without using the atmosphere scattering model it produced better results than all previous papers using that model.

You can get the research paper from here.

**Left:** Hazy image, **Right:** Dehazed image using GMAN

Atmosphere Scattering Model

The equation used to represent ASM is:

where,
I(x): Observed hazy image
J(x): Original image
A: Global Atmospheric Lighting
t(x): Transmission Matrix

A refers to the natural light of the atmosphere across the entire scene.
t(x) represents the amount of light that reaches the camera from the object.
It is calculated as follows:

Before GMAN all the methods work on getting the parameters A and t(x) to restore the clear image from a hazy image. But it is unlikely that the problem of lossy reconstruction of the original image can be transformed equivalently to an estimation problem for parameters A and t(x) (or their variants), at least when the two problems are subject to the same evaluation metric. Apart from that, the complex relationship between the original and hazy images cannot be just captured by the atmosphere scattering model(ASM). Also using ASM may give good results on synthetically hazed images but it fails to generate desirable outcomes on the naturally hazed images. A solution for this is proposed by the GMAN network.

GMAN + Parallel Network Architecture

Let’s start with GMAN

**GMAN Architecture(without perceptual loss)**

Functionally addressing, this architecture is composed of an end-to-end generative method which uses an encoder-decoder approach to solve the dehazing problem. The first two layers which input image encounters are convolution blocks with 64 channels. Following them are 2 downsampling blocks(encoder) with stride 2. The encoded image is then fed to a layer build with 4 residual blocks. Each block contained a shortcut connection(same as ResNets). This residual layer helps to understand the haze structures. After these comes the upsampling or deconvolutional (decoder) layer which reconstructs the output of residual layers. The last two layers are convolutional blocks that transform the upsampled feature map into a 3 channel RGB image which finally added to the input image(global residual layer) and then ReLU to give dehazed output. This global residual layer helps in capturing the boundary details of objects with different depths in the scene. The encoder part of architecture helps to decrease the dimension of the image and then fed the downsampled image to the residual layer to extract the image features and the decoder part is expected to learn and regenerate the missing data of the haze-free image.

In the original paper, there was no parallel network but instead, to improve performance along with mean-squared loss, the perceptual loss was also used but by the same author this parallel network was added later and it performed better than the original architecture with perceptual loss. So I’m not going to a perceptual loss here.

Now let’s deal with Parallel Network

The dilation rate of these blocks are green: 4, blue: 2 and purple: 1 — Parallel Network, The dilation rate of these blocks are green: 4, blue: 2, and purple: 1

Dilated Convolution with kernel-size 3, dilation rate 2.

PN architecture is shallower than GMAN but has the same encoder-decoder structure. But what difference it makes is it uses dilated convolutions(or convolution with holes) instead of traditional convolutions. The convolutional layer with a larger reception field is not learnable to generate a generalizable feature map. Dilation allows the exponential increase of the reception field without spatial dimensions loss in pixel level. In short, this parallel network helps to capture the details overlooked by the GMAN network. The First 3 layers are encoder blocks with a dilation rate of 4, 2, 2 respectively. Then coming three layers have a dilation rate of 1. Followed by a deconvolution block with dilation rate 4 to transfer the feature maps to their original size. After this comes to the final convolution layer which transforms the image to a 3 channel RGB image. Throughout now all layers have 64 channels except the last one. After this same as GMAN output is added with the original input and passed through the ReLU unit.

Now after all this output of both the networks added to get the final dehazed image. Parameters alpha1 and alpha2 are supposed to be learned by the network itself. Enough of theory now we’ll start with code.

Code in TensorFlow v2.3

You can get the complete code from my GitHub.

Preprocessing and loading data

This function takes the path to the image and read it and decode it into a uint8 tensor. Then we resize it to 412x548, image size in the dataset is 413x550 but the network function is creating issues because of odd input values. Finally, we normalize it and returns the image in the form of normalized tensors.

This function takes the path to the original and hazy image then make a split dictionary with keys: ‘train’, ‘val’ with 90% training data and 10% validation data. If you go through the dataset then the original images have a name like ‘0011’ and the corresponding hazy images have a name like ‘0011_0.85_0.2’, so every original image has more than one hazy image where numbers after underscore represent some kind of ratio in which haze is added to the original image. So below function, groups, the original image, and corresponding hazy images and returns them.

The below function applies the from_tensor_slices() function on the original and hazy image paths of the training set followed by a mapping function that loads the image(load_image). Then after that zip both the training datasets. Do the similar for the validation dataset. Finally return both datasets.

We use the below function to display the output of validation data after the training of the individual epoch. It takes a model, hazy, and original image as an argument.

Network function

Below is the network function. We’ve used Conv2D, Conv2DTranspose to construct this function. First is the GMAN network in which all the layers have several filters as 64 except encoding layers having 128 and the final output layer has 3 channels(RGB). After GMAN, Parallel Network(PN) has dilated convolution layers with all layers having 64 filters except the last one with 3 channels(RGB). I’ve explained architecture in detail above.

Training

The best thing I learned from this project is how to custom train a model. Usually fit, predict looks fascinating but the real way of training is this. The below function trains the model. In each epoch, we have a training loop and a validation loop. In the training loop, we take the training data and compute the gradients and apply them to calculate training loss. In the validation loop, we take those gradients computed in that epoch and apply them on validation data to check the output(using display_img function) and validation loss. Finally, we save the model(weights, variables, etc) of that epoch and reset the loss metrics.

Now before we train our model, we’ll define some hyperparameters. I’m using a batch size of 8 because above that GPU runs out of memory. We are not initializing kernel weights as zero, rather using random normal initialization is providing better results. And to reduce overfitting, L2 regularizer with weight decay of 1e-4 is used. Note that every layer is not having the same kernel initialization, it’s according to the research paper. Finally, call the training function.