FAST AI JOURNEY: COURSE V3. PART 2. LESSON 10.

Documenting my fast.ai journey: PAPER REVIEW. INSTANCE NORMALIZATION: THE MISSING INGREDIENT FOR FAST STYLIZATION.

6 min readApr 6, 2019

For the Lesson 10 Project, I decided to dive into the 2016 paper, named Instance Normalization: The Missing Ingredient for Fast Stylization, by Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky. The authors have made the code available here.

As always, our objective is to go through all the sections one by one, understanding and summarizing them.

1. Introduction.

The authors first start citing the style transfer research performed by Gatys et al. (2016).

Figure 1. Source: https://arxiv.org/pdf/1607.08022.pdf.

Remember that when we are doing style transfer, our objective is to obtain an image (stylized image) that mixes the patterns of the style image and of the content image. In other words it must match simultaneously the statistics of both images.

These statistics are obtained from a deep CNN that was pre-trained for image classification. This is:

Style Statistics: obtained from the shallower layers, and “averaged across spatial locations”.
Content Statistics: obtained from the deeper layers, and “preserve spatial information”.

Put in other words, the authors say the the style statistics are used to capture the texture (e.g.: colours, shapes) of the style image, while the content statistics capture the structure (e.g.: object location logic) of the content image.

The method described by Gatys et al. (2016) has produced very good results, but it is compute heavy, using an iterative optimization method until matching the necessary statistics.

The Ulyanov et al. (2016), and Johnson et al. (2016) works that appeared later wanted to remedy the above problem by applying learning equivalent feed-forward generator networks. These similar methods, although produced images that were close in quality, failed to create the high quality stylized images obtained using the Gatys et al. (2016) method.

Here the authors want to revisit the method used in the Ulyanov et al. (2016) paper, but introducing a change in the generator architecture. Thus, they were able to obtain high quality stylized images in real time.

The main innovation here is that the authors:

Replace the batch normalization layers in the generator architecture with instance normalization layers.
Keep these during the test time. Recall that for batch normalization we would freeze and simplify the layers.

The normalization assures the removal of contrast-information that is present in each content image, which translates to higher quality images.

2. Method.

In this section the authors, review the work performed by Ulyanov et al. (2016), in which they proved that they can train a generator network of the form:

Generator Network. Source: https://arxiv.org/pdf/1607.08022.pdf.

that can apply the style of an image x to another image x_0. This was able to reproduce the Gatys et al. (2016) results, albeit with resulting lesser quality images.

In this case we have:

A fixed style image x_0.
An input image x.
A random seed z, used to obtain a sample of stylization results.
A generator g that is trained to apply the style to any input image x.

Note that our generator is a CNN, which minimizes the following Loss Function:

Loss Function. Source: https://arxiv.org/pdf/1607.08022.pdf.

As we said before:

x is a content image, with an index t =1,…, n.
z are identically independently distributed (i.i.d.)samples extracted from a Gaussian (Normal) Distribution, with mean 0 and variance 1.
L uses a pre-trained CNN to extracts, and compares the statistics of the x_0 (style) image, x_t (content) image, and the g(x_t, z_t) (stylized) image.

The main drawback of this method is that, learning from a large set of examples will result in worse results (qualitatively). The authors state that training with only 16 examples resulted in better outcomes, than training with 1000 examples.

So for example in Figure 3, we can see artifacts along the border of the image, that is a result of the the zero padding that we add before the convolution. Other padding techniques could not solve this issue.

In Ulyanov et al. (2016) using a small number of images, and stopping training early, produced better results, being the hypothesis of the authors, that for a standard CNN architecture this training objective was hard.

The authors observe that the stylized images should not depend on the contrast of the content image, but on the contrast of the style image, discarding the content image contrast. The question lies, if they should implement contrast normalization:

Using standard CNN building blocks or,
Directly in the architecture.

The previous efforts, namely Ulyanov et al. (2016), and Johnson et al. (2016), used a combination of convolution, pooling, upsampling an batch normalization. This could make the learning of contrast normalization function really hard. The reason is the following. We have an input tensor:

Input Tensor. Source: https://arxiv.org/pdf/1607.08022.pdf.

which contains a batch of T images. So an input tensor:

Element tijk. Source: https://arxiv.org/pdf/1607.08022.pdf.

is the element, where the subscript denotes:

k and j map the spatial dimensions.
i is the feature channel, or colour channel for RGB images.
t is the index of the image.

So contrast normalization could have the form:

Contrast Normalization. Source: https://arxiv.org/pdf/1607.08022.pdf.

Ultimately the authors doubt that this could be included with CNN building blocks, like a convolution and a ReLU.

It is noted that the generator network used in the work by Ulyanov et al. (2016) uses a batch normalization layer, with the form:

Batch Normalization. Source: https://arxiv.org/pdf/1607.08022.pdf.

The difference between (1) and (2) (batch normalization), is that we apply the normalization to a batch, and not to an instance.

So the authors propose to apply batch normalization to every image, calling it instance normalization or contrast normalization, which will take the following form:

Instance Normalization. Source: https://arxiv.org/pdf/1607.08022.pdf.

So replacing batch normalization with instance normalization, avoids instance mean and covariance shift, making our training easier. The authors don’t only apply this during training time, they also use it during testing.

3. Experiments.

The authors used the batch normalization on the architectures proposed in Ulyanov et al. (2016), and Johnson et al. (2016), observing that they have similar results, as shown in the first row of Figure 5:

Figure 5. Source: https://arxiv.org/pdf/1607.08022.pdf.

They found that replacing batch normalization, with instance normalization, they could achieve superior results, as shown in the second row of Figure 5:

Finally, the authors observed that the generators have similar quality, but they state that:

we found the residuals architecture of Johnson et al. (2016) to be somewhat more efficient and easy to use…

So these results are shown in Figure 4:

Figure 4. Source: https://arxiv.org/pdf/1607.08022.pdf.

Finally they show, the same stylized image result at different resolutions:

Figure 6. Source: https://arxiv.org/pdf/1607.08022.pdf.

4. Conclusion.

The authors show that by applying batch normalization to every image (instance normalization) they could achieve better results for some image generation architectures.