Video restoration and visual quality enhancment using U-net networks

Matteo Marulli

6 min readMay 16, 2022

Many Tv companies are interested in restoring their old materials such as old documentaries or old movies.

These materials might be stored on analogic support such as vhs or Betamax, let’s see an example in this video.

From the video we can detect some problems:

· The image colors are faded;

· High presence of noise;

· Low resolution (caused by old standards);

· Details in low frequency are not much emphasized (such as edge, corner ecc…)

Let see now how we can overcome all these problems.

In Computer vision there is an interesting research field called Image enhancement where computer scientist and engineer researchers developed algorithms that accepts an image in low resolution or low quality as input, and outcomes an equivalent image of input but in high resolution and high quality. Look at this pic for example.

Today I’m going to show you some interesting Deep learning techniques for Image enhancement.

All these techniques are based on GAN frameworks annd for those who don’t know what a GAN is, no worries I will briefly explain it for you.

In simple terms GANs are neural network architectures that involve the training of 2 neural networks: the Generator network G and the Discriminator network D. The goal of a GAN is to generate credible data, in this case credible images.

Training a GAN is not too simple, because there might be many hard problems. One of this strange phenomenon is called “mode collapse” that means that Generator Network generate the same data. There would be a lot to say on GANs, but for todays purpose you know everything you need to know about GAN.

As I said, many image enhancement deep learning techniques are based on GAN, here a few of this:

· SR-GAN

· ESR-GAN

· SR-UNET

SRGAN is a generative adversarial network for single image super-resolution. It uses a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes the solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and the original photo-realistic images. In addition, the authors used a content loss motivated by perceptual similarity instead of similarity in pixel space. The actual networks depicted in the following pic and consist mainly of residual blocks for feature extraction.

ESR-GAN is an enhanced version of SR-GAN developed by Xintao Wang et al. in 2018. This version increases very much the performances of image enhancement tasks, and it’s considered the state of the art.

SR-UNET (F. Vaccaro and M. Bertini) is a generative network for a GAN framework and it is based on U-Net framework. I will explain to you the SR-UNET technique because I worked with this for my Master thesis.

SR-UNET is a U-NET designed for real-time video quality enhancement for streaming videos compressed with MPEG with codec H.264/H.265.

Due to these compression algorithms such as MPEG, streaming videos suffer of visual artefact such as: mosquito noise, blocking, posterization.

SR-UNET can fast improve these videos removing all artefacts and performing super resolution for streaming video and old footage, let’s see how.

In the following pic I reported the net schema. As you can see the schema looks like the classic U-NET schema by Olaf Ronneberger.

If you don’t know what is a U-NET, I will explain it to you in a nutshell. A U-NET network is a neural network based on encoder-decoder architecture, designed for image-segmentation.

Between SR-UNET and the classic U-NET there are of course some differences. One of these is the number of filters for the convolution layers, in SR-UNET the number of filters is constant.

The number of filters of the convolutional layers is constant because the SR models do not require a large number of filters, unlike the convolutional classification models. The only layers that do not follow this rule are the first layer of the encoder and the last layer of the SR-UNET decoder, these layers have the number of filters halved.

This design choice makes it possible to have a neural network with fewer learning parameters than the classical U-NET, since the latter uses a number of filters progressive to the number of layers present in the network.

Another difference between SR-UNET and U-NET is the use of the Pixel Shuffle layer, which is used to upscale the frame quickly since it is the fastest upsample layer available: it comprises a depth-compression of the output tensor into 12-channels via-convolution operation, and then these features are reshuffled into an RGB image but at double resolution.

The image generated by SR-Unet is defined as follow:

where x^SR is the super-resolved output, 𝑥 ^𝐿𝑅 is the low-resolution input, 𝑈 is the convolutional SR-network, 𝑢𝑝𝑠𝑎𝑚𝑝𝑙𝑒 is an upsample filter such as bilinear or bicubic interpolation, and 𝐻𝑎𝑟𝑑 tanh is for clipping the output between the interval [−1, 1].

Modelling the problem as producing a residual on the top of the upsampled image is particularly convenient. This forces the model to focus on the high frequency patterns sharpening edges or increasing texture details, since the low frequency patterns are still from the upsampled image.

One further modification aimed to increase model capacity without adding computational cost is modify the skip connections as follow:

where 𝑥′ is the output tensor, 𝑥 is the input, W 𝑛×𝑛 , W 1x1 are the weights of a convolutional layer respectively with kernel size 𝑛×𝑛 . For simplicity, biases are omitted. The arguments of the non-linear function can be easily refactored into one single 3 × 3 layer, which filters are computed as follow:

where W3×3 are the 3 × 3 filters as before; to transform a layer with 1 × 1 kernels into 3 × 3, it is enough to add a zero-padding around the filters, and the identity skip connection can be easily modelled as a layer containing diagonal identity.

To training SR-UNET we have to minimize this loss function:

Where y is the reference image, x^ is the output of SR-UNET, is a parameter of importance for the loss Discriminator, LPIPS and SSIM are perceptual metrics to eval image quality.

As you can see by the loss formultation you will need of a training set made up of couple image (X_lr, y) where y is X_lr but at best quality as possible.

To improve performance, I made several changes to the architecture. The modifications increase performance without dramatically increasing the number of sr-unet weights and the training time.

The principal mod are:

Atrous Spatial Pyramind Pooling (ASPP);
Squeeze&Excitation block.

Atrous Spatial Pyramind Pooling is a pyramidal block of atrose convolutions.
ASPP block is used to enlarge the receptive field of convolutions in order to capture more global context, I placed it between the encoder and decoder, the scheme of an ASPP block is shown in in the following picture:

Squeeze&Excitation blocks are attention blocks and are used to measure the importance of features detected by the different channels that make up the output feature map, I have used this block to measure the output feature map of each encoder block.

Here a schema of Squeeze&Excitation block:

If you are interested in sr-unet, I leave you this video where I show you a series of videos restored with this network.

Video restoration and visual quality enhancment using U-net networks

Written by Matteo Marulli