SRCNN Paper Summary & Implementation

Published in

Analytics Vidhya

6 min readMar 14, 2021

Summary

SRCNN[1] proposes a 3 layer CNN for image super-resolution. It is one of the first papers to apply deep neural networks for the task of image super-resolution. The SRCNN architecture is composed of three components: Feature extractor, non-linear mapping, reconstruction. The model is trained to minimize the pixel-wise MSE between the reconstructed and ground truth image. A variety of model architectures and hyperparameter is tested and traded for performance and speed in the paper.

Model Architecture

The proposed architecture conceptually consists of 3 components: Feature extractor, non-linear mapping, reconstruction. Each are responsible for extracting low-resolution features, mapping into high resoltion features, reconstruction. The low resolution image is bicubic interpolated into Y, with the same size as the high resolution image, X. The model aims to learn a mapping F: Y->X.

Conclusively, each component turns out to be represented as one convolution layer, resulting a 3-layer convolutional neural network, with kernel size 9–1–5. According to the figure below, the intermediate outputs of each layer seems to contain the necessary information they are expected to compute.

Loss

The loss fuction is defined as the pixel wise MSE(Mean Squared Error) between the reconstructed image F(Y) and the ground truth image X. This will result in training to maximizing the PSNR measure.

Experiments

The paper experiments various hyper-parameter settings to improve performance. The figure below shows how a network channel sizes of 9–1-5 outperforms other settings, and that deeper networks that increase the capability of the non-linear mapping based on the 3-stage methodology of the paper were unnecessary. Although the concept of the proposed model has some known flaws proven to be false by further research in deep learning, the experiments shows the high performance of the proposed 3-stage methodology for SR.

Implementation

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import requests
import tensorflow_datasets as tfds
from tqdm import tqdm
import os
import shutildata=tfds.load('tf_flowers')

Import necessary libraries. We will use the tf_flowers dataset, consisting of 3600 images of flowers as a small toy dataset for training.

train_data=data['train'].skip(600)
test_data=data['train'].take(600)@tf.function
def build_data(data):
  cropped=tf.dtypes.cast(tf.image.random_crop(data['image'] / 255,(128,128,3)),tf.float32)

  lr=tf.image.resize(cropped,(64,64))
  lr=tf.image.resize(lr,(128,128), method = tf.image.ResizeMethod.BICUBIC)
  return (lr,cropped)def downsample_image(image,scale):
  lr=tf.image.resize(image / 255,(image.shape[0]//scale, image.shape[1]//scale))
  lr=tf.image.resize(lr,(image.shape[0], image.shape[1]), method = tf.image.ResizeMethod.BICUBIC)
  return lr

We split the test data as the first 600 images in the dataset, and define a function build_data to random crop the given image by a size of (128,128) and return the low resolution and high resolution copy of the image. The low resolution copy is generated through bicubic interpolation as proposed in the paper.

for x in train_data.take(1):
  plt.imshow(x['image'])
  plt.show()

train_dataset_mapped = train_data.map(build_data, num_parallel_calls = tf.data.AUTOTUNE)
for x in train_dataset_mapped.take(1):
  plt.imshow(x[0].numpy())
  plt.show()
  plt.imshow(x[1].numpy())
  plt.show()

SRCNN_915=tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(64,9,padding='same',activation='relu'),
    tf.keras.layers.Conv2D(64,1,padding='same',activation='relu'),
    tf.keras.layers.Conv2D(3,5,padding='same',activation='relu')
])def pixel_mse_loss(y_true,y_pred):
  return tf.reduce_mean( (y_true - y_pred) ** 2 )def PSNR(y_true,y_pred):
  mse=tf.reduce_mean( (y_true - y_pred) ** 2 )
  return 20 * log10(1 / (mse ** 0.5))

def log10(x):
  numerator = tf.log(x)
  denominator = tf.log(tf.constant(10, dtype=numerator.dtype))
  return numerator / denominatorSRCNN_915.compile(optimizer=tf.keras.optimizers.Adam(0.001),loss=pixel_mse_loss)

We define the training loss: pixel_mse_loss as proposed in the paper, and also define a PSNR function to evaluate the PSNR loss of the model. The model architecture is a 3-layer CNN with kernel sizes 9–1–5.

for x in range(50):
  train_dataset_mapped = train_data.map(build_data, num_parallel_calls = tf.data.AUTOTUNE).batch(128)
  val_dataset_mapped = test_data.map(build_data, num_parallel_calls = tf.data.AUTOTUNE).batch(128)
  
  SRCNN_915.fit(train_dataset_mapped,epochs=1,validation_data=val_dataset_mapped)

24/24 [==============================] — 6s 215ms/step — loss: 0.0413 — val_loss: 0.0138

24/24 [==============================] — 5s 209ms/step — loss: 0.0116 — val_loss: 0.0094

24/24 [==============================] — 5s 210ms/step — loss: 0.0084 — val_loss: 0.0073

…

Every epoch, the images are re-cropped to generate new samples of the images. The model can be trained for more iterations to improve performance, but the loss generally didn’t go under 0.0032.

train_dataset_mapped = train_data.map(build_data,num_parallel_calls=tf.data.AUTOTUNE)
for x in train_data.take(10):
  fig=plt.figure(figsize=(12,4))

  plt.subplot(1,3,1)
  plt.imshow(x['image'].numpy())
  plt.axis('off')
  plt.subplot(1,3,2)
  lr=downsample_image(x['image'].numpy(),4)
  plt.imshow(lr.numpy())  
  plt.axis('off')
  plt.subplot(1,3,3)
  pred=SRCNN_915(np.array([lr]))
  plt.imshow(pred[0].numpy())
  plt.axis('off')
  plt.show()

…

The first image is the original HR image, the second image is the bicubic interpolated, and the last image is the super resolution image. Because the only weights in the network are convolution filters that are not dependent on image size, the network can input images that are not the same size as the training data that was cropped to (128, 128) for batch training.

The trained SRCNN doesn’t show radical increase in image resolution perceptually compared to bicubic interpolated image. I plan to conduct tutorials in more recent advanced papers in SR. We finally visualize the outputs of the intermediate layers.

layers=SRCNN_915.layers
train_dataset_mapped = train_data.map(build_data,num_parallel_calls=tf.data.AUTOTUNE)
for x in train_dataset_mapped.take(1):
  image=x[0].numpy().reshape(1,128,128,3)input_image_layer=layers[0].input
for idx,l in enumerate(layers):
  print("Output of layer",idx)
  intermediate_model=tf.keras.models.Model(input_image_layer,l.output)
  out=intermediate_model(image)
  fig = plt.figure(figsize=(20,4))
  for i in range( min(out.shape[-1], 20) ):
      plt.subplot(2, 10, i+1)
      plt.imshow(out[0, :, :, i] * 127.5 + 127.5, cmap='gray')
      plt.axis('off')
  plt.show()

Output of layer 0

Output of layer 1

Output of layer 2

My Opinions

I believe the 9–1–5 structure of the proposed CNN is not the optimal model architecture for SR, although the authors conducted multiple experiments for various hyper-parameter settings.
The pixel-wise MSE is also not the best loss function for capturing the perceptual distance, as discussed in further researches that propose more perceptual losses, such as GAN losses and VGG-losses[2].
Despite this is one of the first papers to adopt neural networks for SR, it does not consists of many recent advances to increase general DL performance(batch normalization, setting kernel size to 3, optimizers…).

[1] Dong, Chao, et al. “Image super-resolution using deep convolutional networks.” IEEE transactions on pattern analysis and machine intelligence 38.2 (2015): 295–307.

[2] Ledig, Christian, et al. “Photo-realistic single image super-resolution using a generative adversarial network.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.