Understand Google’s cutting-edge HDRnet in 10 minutes

Published in

The Artificial Intelligence Journal

9 min readAug 18, 2018

HDR and mobile image enhancement has long been an area of interest within the realm of computer vision. With the growing popularity of deep learning, more and more innovative pipelines and solutions are being applied to solving this problem with progressively greater success.

Many solutions have been proposed to tackle image enhancements on desktop devices. In fact, consumer products such as Adobe Photoshop CC already make cutting-edge technology available to the consumer.

Before vs. After images taken from HDRnet paper

However, the mobile field is a bit more disappointing. The problem is, although there are loads of advances in image processing as a whole, the majority aren’t very practical to use on mobile since they require a lot of computing power. Since a huge amount of photo-taking and editing happens on phones, this has become a very pressing field of research. Out of this, HDRnet was born: an image adjustment pipeline that can perform in realtime on mobile devices.

(Please note guys, this is a How-It-Works article, not a How-To-Implement article. I’ll try to include details regarding implementation as much as I can but if you’re really serious about trying to implement it, make sure you read the actual paper, it’s got a lot more detail than this article.)

To get an idea of just how good HDRnet is, take a look at the following demo:

Pretty good right?

The network is broken down into two streams: a high-res stream and a low res stream. This is shown in the architecture diagram below.

Before I go on, there are a few basic prerequisites in terms of things you should know. Firstly, you should have a working understanding of CNNs, as well as the terms ‘fully-connected’ and ‘(strided) convolutional layers’. If you don’t, I highly recommend checking out the freely-available MIT online deep learning textbook, the website can be found here, and a quick google search reveals a pdf version can be found here (I recommend using the bookmarked version as it’s way easier to navigate). If you’re lazy or don’t have the time, a quick yet intuitive guide to convolutional neural networks and stride can be found here, where they explain both concepts with easy-to-understand graphics. A one sentence definition of stride taken from the last link would be that it is the “size of the step the convolution filter moves each time”.

Anyways, back to the pipeline.

Low Res Coefficient processing

This module of the pipeline takes the full-res input I, and scales it down to a lower resolution 256x256, forming Ĩ.

The input Ĩ as mentioned before has a fixed input size of 256x256. It is first processed by a stack of strided convolution layers (S_i)i=1…n_S with a stride of s, to extract low-level (focal points, edges etc.) features and reduce the spatial resolution. Following that, the module splits into two streams, one to extract local features (L_i)i=1…n_L, and one to extract global features (G_i)i=1…n_G, in a design inspired by “[Iizuka et al. 2016]”, don’t worry if this doesn’t make much sense, we’ll take a look at local vs global later.

The first path S_i is fully convolutional, the paper cites “[Long et al. 2015]” for their inspiration, and focuses on learning local features that propagate image data while simultaneously retaining relevant spatial information.

The second path G_i uses a combination of fully-connected (FC) and convolutional layers to learn a fixed-size vector of relevant global features (such as scene category, indoor/outdoor, colour distributions etc., features that are common between images thus global) that help prevent the formation of localised colour distortions and other undesired artefacts.

The outputs of these two streams G^(n_G) and L^(n_L) are then fused together into a single set of features F by a learnable linear function rather than simple concatenation, presumably to reduce the inclusion of irrelevant information and increase the accuracy of the network. Following that, a “pointwise linear layer” outputs a final array A from the fused streams, which is then used by the slicing layer in the full-res processing layer. The paper interprets the array A as a “bilateral grid of affine coefficients”.

This jargony sentence makes a lot more sense if you actually know what a bilateral grid and affine transformations are so if you don’t, you should probably at least do a quick Google before proceeding. Intuitively, a bilateral grid allows you to form a 3D representation of a 2D image by using the pixel intensity as the third axis. Check out the following diagram taken from a presentation about bilateral grids:

As you can see, the pixel intensity is being used in place of a Z axis. One result of this is that it preserves edges even with aggressive downsampling (i.e. reducing resolution), therefore making it very useful for edge detection. For more information, I would actually recommend looking through this presentation from MIT on them.

An affine transformation isn’t really that complicated, it’s essentially just a type of transformation that can be applied to images while preserving points, straight lines and planes. It’s typically used for correcting geometric distortions but in this case, it isn’t really important to know how it directly applies, just knowing it’s a form of image transformation and that it can be used for colour correction is enough.

Anyways, let’s break down the pipeline more.

3.1.1

The low-resolution image is first processed by a series of strided convolutional layers with s=2. At this point it’s important to cover a few important technical notes about the author’s implementation that can help you if you try to read the paper for yourself or implement it on your own. Their sentence is about as succinct as it can get so I’ll just quote it here: “Where i = 1, . . . ,n_S indexes the layers, c and c’ index the layers’ channels, w_i is an array of weights for the convolutions, b_i is a vector of biases, and the summation is over −1 ≤ x ′ ,y ′ ≤ 1 (i.e., the convolution kernels have 3 × 3 spatial extent)” — this basically just tells is that w_i is an array of weights and b_i is a vector of biases (in accordance with convention), as well as that the CNN kernels have a window size of 3 x 3.

These layers have the cumulative effect of reducing the dimension of the image by 2*n_S, therefore showing that n_S has two effects: (1) the greater its value, the coarser (less sharp) the final grid, (2) it controls the complexity of the prediction. The more layers the network has, the more complex the function it models can be and thus, it can extract more complex patterns in the network. The authors chose the sweet-spot of n_S=4 for this paper.

The convolutional layers in this section take the following form:

3.1.2 — Local Features path

This section of the pipeline deals with local features. Local features are features that are relevant only to specific localities in the image (e.g. there’s a face here, this top bit is a sky). The final low-level feature path is processed by a stack of n_L=2 convolutional layers Li that take the same form as the equation in Sec. 3.1.1, but with s=1(a stride of 1).

An important function of the local features path that should definitely be kept in mind is that it gives predicted affine coefficients a sense of spatial location. This means that when the affine transformation is actually applied, it’s less prone to localised distortions like having ugly blotches or discolourations in random places.

3.1.3 — Global Features path

The global branch starts out from the same place the local branch did, S^(n_S). It consists of two strided conv. layers (same as the equation from 3.1.1 with s=2) followed by three fully-connected (FC) layers. This stream produces a 64-dimensional vector that summarizes global information (i.e. colour distribution, indoor/outdoor?, day/night?) about the image and it has a kind of regularizing effect on the transformations made by the local path. Without this branch to encode global info about the image (and the regularizing effect it has), the network is prone to producing unwanted localised artifacts like the one below (hint- check out the sky in b, see the blotching?):

3.1.4 — Fusion and Linear Prediction

The authors say that the two streams are fused with “a pointwise affine mixing”, which is basically just a fancy way of saying a linear equation.

This yields a 16*16*64 dimensional array of features which is then subsequently used to make the final 1*1 prediction to produce a 16*16 map with 96 channels. What this means in english is that the resulting final output A contains the predicted affine coefficients which are later used in the colour transformations which occur when the coefficients are actually applied (the final step in the pipeline — if you check out the diagram, you can see it in bright green).

3.2–3.4

We can summarise the rest of the paper pretty easily. Essentially, the output A from the previous fusion stage is treated as a bilateral grid whose third dimension has been unrolled. Ignoring some of the more tedious details of the paper, you should note that only by treating the final output A as a bilateral grid, the network itself can learn (and is responsible for) making the switch from 2D to 3D.

The segment labeled 3.3 is responsible for upscaling the low-resolution bilateral grid for usage with the higher-resolution original input image I. The authors mention that this layer is a specially built bilateral grid ‘slicing’ operation — which essentially helps upscale the smaller (read: low-res) bilateral grid of affine coefficients to a larger (read: high-res) one without quality degrading. This takes as input a guidance map g which is learned in stage 3.4.1 (which funnily enough actually comes before 3.3, it’s just that the authors decided to explain it later in their paper and therefore the numbering is a bit weird), as well as A (which is again viewed as a bilateral grid). Also, note that the slicing operation itself is parameter-free, it’s the guidance map which is learned.

By performing all inference operations included in the slicing within a bilateral grid, the model is forced to follow the edges from g, therefore making the pipeline as a whole more edge-aware.

Also, just note the guidance map g is just a learned grayscale mapping of the input image I.Its primary purpose is to help the slicing layer approximate the desired enhancements with much higher fidelity.

Finally, the actual assembly into the final output is fairly trivial:

All in all, this is a very impressive paper. HDRnet is able to produce at comparable outputs to other cutting-edge solutions while doing it much faster, and without the overhead of having a client-server architecture while still being able to learn almost any type of edit. If only I had an NVIDIA Titan X….

https://groups.csail.mit.edu/graphics/hdrnet/data/hdrnet.pdf

Understand Google’s cutting-edge HDRnet in 10 minutes

Low Res Coefficient processing

Written by Khush Jammu