EnhanceNet: Single Image Super-Resolution Through Automated Texture Synthesis

Published in

The Startup

4 min readAug 13, 2020

In this article, the highlights of EnhanceNet paper for single image super-resolution and the performance of various versions of it are discussed. I would like to mainly emphasize its loss functions which are responsible for a perceptual quality close to the original high-resolution image.

Background

In Single Image Super-Resolution, objective metric-based (like Mean Squared Error) CNN models might give good PSNR values but typically produce over-smoothed images, and thereby lack the ability to capture high-frequency features in an image. EnhanceNet is a Generative Adversarial Network which focusses on generating realistic textures along with higher perceptual quality rather than just improving on the PSNR values.

Network Architecture

EnhanceNet has a generic CNN architecture with a feed-forward fully convolutional network comprising of 10 residual blocks as they help in faster convergence of a model.
Instead of the convolution transpose layers, nearest neighbor upsampling followed by a convolution layer is used in the upsampling part of the network to avoid unnecessary artifacts.
Finally, a bicubic interpolation of the low-resolution image is added to the reconstructed output to avoid any color shifts and to ensure training stability.

Loss Functions

These are the pillars of EnhanceNet, the performance of various combinations of these losses were studied.

(i) Pixel wise MSE-loss in the image domain (Le)

est: Network Estimated Image; HR: High-Resolution Image

This is a baseline approach loss which is the mean squared error between estimated image and ground-truth.

(ii) Perceptual Loss in feature space (Lp)

est: Network Estimated Image; HR: High-Resolution Image

Perceptual losses are generally based on generating outputs for the super-resolved image from different layers of the famous VGG network (which is the feature map — φ). In a pre-trained VGG-19 network, mapping an image to the initial convolutional layers’ feature space focusses on high-frequency features (detailed pixel-content) while mapping to the later layers emphasize the primary structure of an image. So, a combination of second and fifth max-pooling layers of the VGG19 network are used to calculate the perceptual loss.

(iii) Texture Matching Loss (Lt)

est: Network Estimated Image; HR: High-Resolution Image

This is based on the style-transfer paper which transfers a texture-style from one painting to another. Patches of the image(16x16 pixels) are used to compute this loss to concentrate on the local texture matching with the high-resolution image. φ represents a feature map generated from the VGG19 network. G(f) is the Gram function which calculates the product of the matrix with its transpose.

(iv) Adversarial Loss (La)

This is the typical GAN minimax loss which tries to optimize the performance of both generator and discriminator.

Evaluation

4x super-resolved images using different variants of EnhanceNet

PSNR values for the super-resolved images on different datasets

From the above two figures, it can be observed that ENet-PAT perceptually looks the closest to HR image, while the ENet-E baseline-approach image looks very blurry though it seems to produce the highest PSNR values.
ENet-P tends to create sharper edges than ENet-E but creates checkered artifacts in place of new texture generation.
ENet-PA produces better details but also has high-frequency noise which is not desired.
Texture loss in ENet-PAT helps it create meaningful local textures, and reduces the noise and artifacts to a large extent.

Reference

[2017 ICCV] [EnhanceNet]
EnhanceNet: Single Image Super-Resolution Through Automated Texture Synthesis

PS:

This is my first Medium story. Please feel free to add any comments or suggestions as required :)

EnhanceNet: Single Image Super-Resolution Through Automated Texture Synthesis

Background

Network Architecture

Loss Functions

Evaluation

Reference

PS:

Written by Shivani Rapole