Single Image Super Resolution Using GANs — Keras

Image Super Resolution:

Image super resolution can be defined as increasing the size of small images while keeping the drop in quality to minimum, or restoring high resolution images from rich details obtained from low resolution images. This problem is quite complex since there exist multiple solutions for a given low resolution image. This has numerous applications like satellite and aerial image analysis, medical image processing, compressed image/video enhancement etc.

Problem Statement:

To recover or restore high resolution image from low resolution image. There are many forms of image enhancement which includes noise-reduction, up-scaling image and color adjustments. This post will discuss enhancing low resolution images by applying deep network with adversarial network (Generative Adversarial Networks) to produce high resolutions images.

Our main target is to reconstruct super resolution image or high resolution image by up-scaling low resolution image such that texture detail in the reconstructed SR images is not lost.

Why Deep Learning?

There are various ways of enhancing image quality. One of the most commonly used technique is interpolation. This is easy to use but this leads to distorted image or reduces the visual quality of image. Most common interpolation methods produce blurry images, i.e. bi-cubic interpolation. More sophisticated methods exploit internal similarities of a given image or, use data-sets of low‑resolution images and their high‑resolution counterparts to effectively learn a mapping between them. Among Example‑Based SR algorithms, the Sparse‑Coding‑Based method is one of the most popular.

Deep learning provides better solution to get optimized images. In recent years many methods have been proposed for Image Super Resolution. Here we will be discussing about SRGAN. Lets see other methods in deep learning:

  • SRCNN : SRCNN was the first deep learning method to outperform traditional ones. It is a Convolutional Neural Network consisting of only 3 convolution layers: patch extraction and representation, non‑linear mapping and reconstruction. For more details you can go through original Paper .
  • VDSR : Very Deep Super Resolution employs the similar structure as SRCNN, but goes deeper to achieve higher accuracy. Both SRCNN and VDSR apply bi-cubic up-sampling at the input stage and deal with the feature maps at the same scale as output. For more details you can go through original Paper.

SRGAN — Super Resolution Generative Adversarial Network

For this lets first understand what are GANs (Generative Adversarial Networks):

GANs : GANs are class of AI algorithms used in Unsupervised Machine Learning. GANs are deep neural network architectures comprised of two networks (Generator and Discriminator) pitting one against the other (thus the “adversarial”). GANs is about creating, like drawing a portrait or composing a symphony. The main focus for GANs is to generate data from scratch.

To understand GANs, first we need to understand what a generative model is. In machine learning, the two main classes of models are generative and discriminative. A discriminative model is one that discriminates between two (or more) different classes of data — for example a convolutional neural network that is trained to output 1 given an image of a car and 0 otherwise. A generative model on the other hand doesn’t know anything about classes of data. Instead, its purpose is to generate new data which fits the distribution of the training data.

GANs consist of a Generator and Discriminator. Think it like a game where Generator tries to produce some data from probability distribution and Discriminator acts like a judge. Discriminator decides whether input is coming from true training data set of fake generated data. Generator tries to optimize data so that it can match true training data. Or we can say discriminator is guiding generator to produce realistic data. They just work like encoder and decoder.

Let make it easy by taking an example-

Lets say we want to generate animated characters face. So we will provide training data which consists of images of anime character faces. And fake data consists of some random noise. Now Generator will try to produce image from noise which will be judged by discriminator. Both will keep training so that generator can generate images which can match true training data. One interesting thing is, images generated by generator will have features from original training data images but may or may not be the same. So like this we can generate some anime faces with heterogeneous mix of features from training data.

Figure 1: GANs basic architecture

Discriminator and Generator are both learning at the same time, and once Generator is trained it knows enough about the distribution of the training samples so that it can now generate new samples which share very similar properties.


Idea Behind SRGAN : We have seen various ways for Single image super resolution. Those ways are fast and accurate as well. But still there is one problem which is not solved. That is, how can we recover finer texture details from low resolution image so that image is not distorted. Recent work has largely focused on minimizing the mean squared reconstruction error. The results have high peak signal-to-noise ratios(PSNR) means we have good image quality results, but they are often lacking high-frequency details and are perceptually unsatisfying as they are not able to match the fidelity expected in high resolution images. Previous ways try to see similarity in pixel space which led to perceptually unsatisfying results or they produce blurry images. So we need a stable model which can capture the perceptual differences between the model’s output and the ground truth image.

To achieve this we will use Perceptual loss function which comprise of Content and Adversarial loss. Other then that SRGAN uses residual blocks for deep neural network.

SRGAN Architecture

Now lets go further into details about SRGAN : Super-resolution GAN applies a deep network in combination with an adversary network to produce higher resolution images.

Training procedure is shown in following steps:

  • We process the HR(High Resolution) images to get down-sampled LR(Low Resolution) images. Now we have both HR and LR images for training data set.
  • We pass LR images through Generator which up-samples and gives SR(Super Resolution) images.
  • We use a discriminator to distinguish the HR images and back-propagate the GAN loss to train the discriminator and the generator.
Figure 3: Generator and Discriminator Network

Above is the network design for the generator and the discriminator. It mostly composes of convolution layers, batch normalization and parameterized ReLU (PRelU). The generator also implements skip connections similar to ResNet.

Few things to note from Network architecture:

  • Residual blocks: Since deeper networks are more difficult to train. The residual learning framework eases the training of these networks, and enables them to be substantially deeper, leading to improved performance. More about Residual blocks and Deep Residual learning can be found in paper given below. 16 residual blocks are used in Generator.
  • PixelShuffler x2: This is feature map upscaling. 2 sub-pixel CNN are used in Generator. Upscaling or Upsampling are same. There are various ways to do that. In code keras inbuilt function has been used.
  • PRelu(Parameterized Relu): We are using PRelu in place of Relu or LeakyRelu. It introduces learn-able parameter that makes it possible to adaptively learn the negative part coefficient.
  • k3n64s1 this means kernel 3, channels 64 and strides 1.
  • Loss Function: This is most important part. As discussed we will be using Perceptual loss. It comprises of Content(Reconstruction) loss and Adversarial loss.
Perceptual Loss
  • Adversarial loss: This pushes our solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images.
Adversarial Loss
  • Content Loss: Content loss we are using so that we can keep perceptual similarity instead of pixel wise similarity. This will allow us to recover photo-realistic textures from heavily down sampled images. Instead of relying on pixel-wise losses we will and use a loss function that is closer to perceptual similarity. We define the VGG loss based on the ReLU activation layers of the per-trained 19 layer VGG network. VGG loss is defined as the euclidean distance between the feature representations of a reconstructed image and the reference image.
Content Loss

SRGAN uses a perceptual loss measuring the MSE of features extracted by a VGG-19 network. For a specific layer within VGG-19, we want their features to be matched (Minimum MSE for features). More about perceptual loss you can find in original paper.

For more details here is link to original papers:

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network :
Perceptual loss:
Deep Residual Learning for Image Recognition :

Implementation Details:

Complete code can be found at:

Data set :

  • Used COCO data set 2017. It is around 18GB having images of different dimensions.
  • Used 800 images for training(Very less, You can take more (approx. 350 thousand according to original paper) if you can collect and have very very good GPU). Preprocessing includes cropping images so that we can have same dimension images. Images with same width and height are preferred. I used images of size 384 for high resolution.
  • After above step you have High Resolution images. Now we need Low Resolution images which we can get by down scaling HR images. I used down scale factor as 4. So we get Low resolution image of size 96.
  • Make sure that images are normalized. In original paper LR images are scaled between [0,1] and HR images are scaled between [-1,1]. But in given implementation we scaled both LR and HR images between [-1,1].
  • Below implementation is in Keras

Below functions returns HR and LR images in form of numpy array from given list of images(List comprising of images in form of numpy array). LR images will be down scaled by 4.

Code for getting HR and LR images

Generator Network: Number of Residual blocks used are 16 and number of up-sampling blocks are 2. Same network architecture is followed as given in Figure 3.

Generator Network

Discriminator Network : Same network architecture is followed as given in Figure 3.

Discriminator Network

Content Loss function : It is computed directly on the generator’s outputs. This first loss ensures the GAN model is oriented towards a deblurring task. It compares the outputs of the first convolutions of VGG.

Contents Loss function

GAN : Generator and Discriminator combined. Here we see two loss used, Content loss as defined above and Adversarial loss(Binary cross-entropy loss).

GAN (Generator and Discriminator combined)

Optimizer function:


Training :

  • Optimizer: Used Adam optimizer with β1 = 0.9 and learning rate 0.0001
  • Number of iterations : 3000 (You can train for more if you have better GPU). For me training took 3–4 days with NVIDIA Tesla P100.
  • Compiled GAN network with VGG loss and binary crossentropy loss for with ratio [1., 1e-3] and optimizer as Adam.
  • Used batch size as 64

Below is complete train function:

Training function

You can find complete Code at Github:

Results : Here are few results after training:

Image 1: Left- LR image, Middle- Generated image, Right- HR image
Image 2: Left- LR image, Middle- Generated image, Right- HR image
Image 3: Left- LR image, Middle- Generated image, Right- HR image

Images used are from COCO data set are for purely non-commercial and experimental purpose.