Guided Super-Resolution as Pixel-to-Pixel Transformation

Published in

EcoVisionETH

5 min readJan 10, 2020

**Low-res depth map (source) + High-res RGB image (guide) = High-res depth map (target)**

What is Guided Super-Resolution?

Guided super-resolution is a unifying framework for several computer vision tasks where the inputs are a low-resolution source image of some target quantity (e.g., perspective depth acquired with a time-of-flight camera) and a high-resolution guide image from a different domain (e.g., an RGB image from a conventional camera); and the target output is a high-resolution version of the source (in our example, a high-res depth map).

Why is it useful?

In the computer vision community, one of the most important applications of guided super-resolution is the super-resolution of depth maps guided by the corresponding RGB images. For instance, many robots are equipped with a conventional camera as well as a time-of-flight camera (or a laser scanner). The latter acquires depth maps of low spatial resolution, respectively large pixel footprint in object space, and it is a natural question whether one can enhance its resolution by transferring details from the camera image. Another example is environmental mapping, where maps of parameters like tree height or biomass are available at a mapping resolution that is significantly lower than the ground sampling distance of modern earth observation satellites.

The standard way of looking at this problem is to formulate it as a super-resolution task, i.e., the source image is upsampled to the target resolution, while transferring the missing high-frequency details from the guide.

**Standard View of Guided Super-Resolution**

Here, we propose to turn that interpretation on its head and instead see it as a pixel-to-pixel mapping of the guide image to the domain of the source image. The pixel-wise mapping is parametrised as a multi-layer perceptron, whose weights are learned by minimising the discrepancies between the source image and the downsampled target image.

The intuition behind our choice of doing a simple pixel-to-pixel mapping from the guide to the source domain is that the guide contains the sharp details that we wish to recover and thus by using a smooth pixel-wise transformation this details will be preserved in the output.

There is a trick to make this simple idea work

Nevertheless, a pixel-wise mapping means that there is a one-to-one mapping from one domain to the other, which is of course not what we want. Otherwise, a specific colour in the RGB image would always be mapped to the same depth value. This is the reason why we add as additional inputs to the mapping function, the x and y coordinates of the pixels in the image. By doing so we make the function location-dependent: the same colour in different locations of the guide image can be mapped to different output values if needed.

The proposed method is unsupervised, using only the specific source and guide images to fit the mapping. For each new pair of images, we solve a new optimisation problem, where we look for the parameters that minimise the following loss:

This loss means that we are looking for parameters that make the downsampled version of the output of our method match as closely as possible to the original low-resolution source.

This problem is extremely ill-posed in fact there is an infinite number of potential output images that look exactly like the source image when they are downsampled. To make this problem well-posed we add an L2-regulariser to the parameters of the mapping function. By doing so we also achieve sharp and smooth results, without enforcing a direct smoothness constraint on the output values that would induce blur.

**Our network:** g is the pixel value of the guide image, x the spatial coordinates of the pixel and *t the output*

By having separate branches of the network for the pixel values (green) and spatial coordinates (blue) we can regularise these parts separately to make the function smoother in the colour domain or in the spatial domain as needed.

Experimental results

Here are some examples of results we can achieve with our method compared to competing methods. We present experiments on two tasks: super-resolution of depth maps, and super-resolution of tree height maps. Our formulation clearly outperforms competing super-resolution methods at high upsampling factors (8 to 32).