What did I learn by implementing neural style transfer?

Published in

Deep Learning made easy

3 min readAug 16, 2019

First of all let’s give an example of what is neural style transfer. This technique aims to grab the content from one image and the style from another and mix both to produce something cool (if you are lucky).

Paper: https://arxiv.org/abs/1508.06576

My own implementation, based on fast.ai lecture, https://github.com/hermesdt/machine_learning/blob/master/style-transfer.ipynb

From this images:

Content from the left, style from the right.

This is what I got:

Not state of the art, but not bad!!

Intermediate layers are useful

This is the most interesting thing I learned from this paper. Neural networks in order to classify images must, somehow, be able to understand the content of the image, must contains structures to detect edges, shapes, etc.

First we need a white noise image like the following:

The trick is that we are going to change this image, instead of using the output of the NN. The gradients will be propagated all the way upto the input image, to do so the image must be a tensor with the option “requires_grad=True”, if you are using pytorch.

Then the algorithm is as follows:

Call the NN with the content image. Pick the output of some conv layer. Store it for later, this is done only once at the beginning, as the content image doesn’t change.
Call the NN with the style image. Pick the output of some conv layer. Store it for later, this is done only once at the beginning, as the style image doesn’t change.
Call the NN with the white noise image. Pick the output of the same layer we used for the content and same layer we used for the style.
Calculate the loss of the content, let’s call it l1.
Calculate the loss of the style, let’s call it l2.
Calculate the total loss as K*l1 + l2. K helps to tune how much importance we want to give to the content.
Propagate gradients as usual and check how have changed the white noise image. Repeat steps 3–7 until you are happy.

Content loss

This is calculated simply as the mean squared error of the captured output. Imagine we used the 4th conv layer of a vggnet network and call it conv4.

Style loss

This is bit more complicated. Here we have to use what is called a gram matrix, and the way I think is like “multiply every filter by every filter”. As this video points out “gram matrix” is the relation between filters. Having a layer of shape (16 filters, 32 width, 32 height), here is what we have to do to calculate the gram matrix:

Flatten the layer to be (num_filters, N), in our case: (16, 32*32) = (16, 1024), and call it matrix F.
Then do a matrix multiplication of F with F transpose, torch.mm(F, F.t()), or by looking on the shapes (16, 1024) * (1024, 16). And we end up with a matrix of shape (16, 16).

So, we calculate the gram matrix of the output of the style image and call it G, calculate the gram matrix of the output of the white noise image and call it A, then calculate l2 = F.mse_loss(G, A).

Bonus

Another pretty cool use of intermediate layers is to find duplicated images like explained in this article: https://towardsdatascience.com/how-to-build-an-image-duplicate-finder-f8714ddca9d2

Credits

As with almost everything I’ve learned about deep learning, the idea and the code of this article comes from the great courses of fast.ai