SuperResolution by Unet and customized Resnet style loss

Published in

datatype

7 min readJan 13, 2020

Unet is one of the universal deep learning algorithm that can do multiple tasks well-enough and only need to change the input-output setup. Not only image segmentation intensively, people nowadays used it in image sharpening, coloring images, to 3d segmentation …

The basic idea of this DIY deep learning is:

Using Unet for superResolution low quality images.
Customize loss that inspired from VGG style transfer loss from the fast-ai course.
My customize loss defined on Resnet that pretrained on Imagnet. My simple argument: VGG is old and Resnet clearly is better architecture for image classification. Thus Resnet should learned better image filters/convolution that useful in image representation for further tasks.

The outline is:

Unet introduction and where we get-in.
Style transfer with VGG
VGG vs Resnet and which Resnet layer we get the feature? As no max-pool layer in Resnet.
Feature losses that inspired from style-transfer.
Experiment detail and results.

Motivation results

Using same number of training loop

SuperResolution by Unet + Vgg style transfer loss. Model 1

Input image size: 96x96, jpg quality = 60 when saving the resized image from the raw image.

My proposal: SuperResolution by Unet + Resnet style transfer loss

Input image size: 128x128, jpg quality = 40. The prediction image

SuperResolution by Unet + Vgg style transfer loss. Model 2

Continue training from previous model 1 of 96x96 size. Double training time.

Input image size: 256x256, jpg quality = 60. The prediction image

Brief conclusion

By appropriate setup + a better loss: reduce half of training time and ‘good enough’ to use quality. Only a bit less detail in the cat’ eyes but I also can be an over-fitting indicator: too smooth for some regions.

My favor is fast-prototyping, done the pipeline and get ‘enough’ results as fast as possible. It is important to get the idea implementation, especially fast with the fastai library.

Unet

The name says it all, the U style architecture for network which contains common deep-networks elements: convolution, max pool, up-convolution (UpSampling or Conv2dTranspose), and 1x1 conv (thanks to Resnet).

Other networks usually downs the output image size after maxpool layers into a small image-size but larger number of features. After that, do the up convolution from that middle layer to the desired output size. This kind of bottle-neck, encode-decode networks transfer gradually the input information into features and try to learn back again. However, this is a harder equation to solve if you do not have enough training data when input image size/information reducing over and over through the network. Generally, it is data compression problem, and computer is just computer, we need to give more information or easier problem.

The breaking point of Unet is reusing the data from middle layers to the up-convolution steps. Simply, more information is better and especially, middle layers contains information in different perspective/feature space.

People call the encoding path here is contracting and decoding is expansive path. And by make ‘crop and copy’ output layers from two symmetric side of the network. There are more information to decode now, more useful, and data efficiency.

Even, the Unet is already public from 2015, people have just used them in recently 2 years from Kaggle competition winners. I admire the way of authors see about the whole limited data segmentation problem. You should read about it if not yet.

Illustration of U-Net architecture. Olaf Ronneberger, Philipp Fischer, Thomas Brox. U_Net: Convolutional Networks for Biomedical Image Segmentation. May 2015

Unet for SuperResolution

Technically, we can do SuperResolution by only using Unet directly.

Down resolution training image.
Training Unet by default loss.
Training data: low resolution image
Training target: corresponding high resolution image. Of course, you need to change the output size: image width x height x 3.

However, getting good results for superResolution task is not easy, there are something more than numerical losses to compare the upsampling image vs real image. In the semantic level, how human compare two images:

Color/style difference.
Detail difference.
Conceptual difference (maybe).

So what makes sense here is a more comprehensive loss. Remember that people done the style transfer between two images into a new one that including style/color/detail of two inputs. It is somehow similar topic now even the application is difference in these points:

I need to keep the style, or transfer to the output.
Did that by more intuitive layer statistics losses.

Style transfer with VGG

Style transfer is quite a hot algorithm around 2015 due to the practical and fun to use aspect. The problem is formulated as:

Input: content image and style image.
Output: generated image that includes content pixels but mixed the style of the ‘style image’.

Style transfer example. https://hackernoon.com/how-do-neural-style-transfers-work-7bedaee0559a

Todo: optimization on the generated random image, putting 3 images over a pretrain neural network to get the input for the loss. They need to be represented in the same feature space before doing the difference measurement.

Style transfer diagram. https://hackernoon.com/how-do-neural-style-transfers-work-7bedaee0559a

As we only care about the loss:

Content loss: L1 numerical loss among pixels (nothing new).
Style loss: it is content loss of the “feature correlation”.

Feature correlation and style loss

At the middle-layer outputs or the activation layer, it is the layer of:

[n, c, h, w] : batch size x feature_channels x feature_height x feature_width
The correlation here is the correlation among feature matrices: [h,w] size. Mathematically, it is called Gram matrix or matrix dot product of features matrices and a sum -> return a number for correlation measurement.
The output will be a feature correlation tensor of [n, c, c]
It is all combination of correlation among feature channels.
Style loss is the content loss of feature correlation tensor of generated and style images.

Remember when input image size reducing, there are no more pixel-level details to be fit rather than general image description.

Where to get the VGG middle-layer outputs ?

Where VGG changes output size is the interested location to get the activation/middle-layer outputs. As we can not do the style loss for all layer outputs and

Clearly, it is where the max-pool layer located. But ResNet don’t.

VGG vs Resnet

In short, Resnet advantages thanks to:

Residual approximation, instead of learn the mapping f(x) from x to y. Learn the residual information that only need to transform x to f(x) + x.
1x1 convolution: still reduce half of size but no need max-pool.

https://cv-tricks.com/cnn/understand-resnet-alexnet-vgg-inception/

To independent the Resnet implementation, just find the layer location of output size change.

Resnet feature loss

Identify Resnet middle layers output in Fastai

Resnet evaluation model purpose.

body = create_body(models.resnet34, pretrained=True).cuda().eval()
requires_grad(body, False)

Send dummy batch, use the model_sizes function, don’t mentioned in the document but you will find it in the fastai source code.

# Dummy batch to get the output size after each level
m_sizes = model_sizes(body, (224,224))

Get the location:

# Get the list where output size changes
list_idx = [i for i,x in enumerate(m_sizes) if i<len(m_sizes)-1 and not(x == m_sizes[i+1])]
list_idx.append(len(m_sizes)-1)

The final list of interesting layer is : [2, 4, 5, 6, 7]

The hook is the fastai/pytorch way to get an ouput from a middle-layer.

The layer weights for style losses are human input based on fast experiment by running the training few times and print all the losses. Rule of thumb: try to make these losses magnitude are close.

Implementation

My github notebook:

https://github.com/nguyenv7/ipynb/blob/master/superres-featureLoss-resnet-preTrain_Imagenet.ipynb

One more detail about the Gram matrix function.

The @ is matmul operator and when the input is the tensor is the batch matrix multiplication. Basically, keep the first dimension as the batch dimension and do multiplication for the rest.

https://kite.com/python/docs/torch.matmul

Conclusion:

With Resnet: doing less training (half training time/steps) and still get a good result as expectation.
Using fastai if you not yet. Some of our work uses fastai elements, work well on the Aws lambda + opencv.

REFERENCE:

As one of the fastai follower from the Jeremy first practical machine learning to deep learning courses, a lot of learning and inspirations from him. This DIY deep learning experiment based on his latest online course, and spent 3 hours of coding and experiment in summer 2019. Maybe you need to modify a bit to run for the new fastai version. The reference source code here:

https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson7-superres.ipynb