Johnson et al Style Transfer in TensorFlow 2.0

Imran us Salam
Red Buffer
Published in
5 min readNov 13, 2019

--

This post is on a paper called Perceptual Losses for Real-Time Style Transfer and Super-Resolution by Justin Johnson and Fei Fei li.
This is a fast method of generating stylized images compared to the paper by Leon A. Gatys on Style transfer.

Style image (taken from https://github.com/pytorch/examples/blob/master/fast_neural_style/images/style-images/mosaic.jpg)
Content Image (taken from https://github.com/pytorch/examples/blob/master/fast_neural_style/images/content-images/amber.jpg)
Output Image (taken from https://github.com/pytorch/examples/blob/master/fast_neural_style/images/content-images/amber.jpg)

The idea behind style transfer is that we impose a style of one image to the contents of another. This is done previously by Gatys et al by optimizing the single content image.

How it worked was we take a single content image and single style image, pass both through a pre-trained deep convolution neural network. Then at some layers we compute loss.

These “some” layers are important and so is the loss function. It is believed that initial layers in a Deep Neural Network account for more general features. That is they are more style oriented and not specific to specialized features. And the deeper layers contribute more to the contents or the specialized features for the input example.

Hence we take the style loss from initial layers and content loss from the deeper layer or layers. We will go into detail what style loss and content loss is.

In the previous paper, we used to freeze the network and optimize either the content image or white (uniform) noise. This is how it looks like

The Picture shows the optimization process which works in the following ways.

  1. Take a pre-trained Deep Convolutional Neural Network (VGG16 as example)
  2. Initialize white noise
  3. Pass style image, content image and white noise all through the network
  4. Take content loss of white noise activation at a deeper layer (conv4), and take style loss across different initial and some upper level layers
  5. Add both the losses with some hyperparameter
  6. DO NOT OPTIMIZE NETWORK WEIGHTS
  7. Optimize the white noise instead

Following these 7 steps in your training loop, you will notice that after some iterations your white noise will start looking like content image with the style of style image.

Content Loss

Let’s understand first what content loss is

Content loss is basically a simple Mean Squared Error between two tensors. Which means we want the exact pixel matching between the content image and the white noise.

Content Loss

Style Loss

Style loss is computed over the output of initial layers between the white noise and style image.

Style loss is exactly the Mean Squared Error between the “Gram Matrix” outputs of the layers.

A gram matrix is a dot product between the output of the layer and it’s transpose.
Intuitively it accounts for correlation between i and j pixels between channels of a tensor.

Style loss

But this whole process is very slow as it accounts for optimizing white noise for each content image and style image. And cannot be re used since we have not optimized any weight parameter.

An implementation is given in https://github.com/imransalam/style-transfer-tensorflow-2.0.
But it has histogram loss added too which was not the part of the original paper. Which can be easily removed.

Fast Style Transfer (Justin Johnson)

Now imagine the same architecture but reusable. The Johnson et al outputs a network which is trained and can be re uses with the same style it was trained on.

Justin Johnson Style Transfer

This is the architecture of Fast Style Transfer. The second part is similar to Gatys et al. But the first part is different.

It involves another Deep Convolutional Neural Network which is the transformation network. This network outputs another image. The weights of this network are learnt during the process.

The flow of the whole architecture is as follows,

  1. Make a transformation Network (we will go into details of this network soon)
  2. Pass the input image through this transformation image
  3. It will create an output image called y_hat
  4. Assume the y_hat is white noise in the previous case,
  5. Pass this y_hat, style image and content image all through the Loss Network (pre-trained VGG16)
  6. Compute Style loss at style layers and content loss at content layer and add them with some hyperparameters.
  7. Use this loss to optimize the weights of Image Transform Net

Once this image transformation network is trained, you can use this network to generate new images with the style imposed on them at the output.

Image Transformation Network

An image transformation Network in the architecture diagram looks a lot like an autoencoder, that is it has two parts. An encoder and a decoder. The architecture has some convolution layers following some upsampling layers with some residual layers in between.

Hope the blog is clear about the Architecture and working of the Style Transfer Networks.

Here is a link to the Justin Johnson Style Transfer written in TensorFlow2.0
https://github.com/imransalam/JJ-style-transfer-TensorFlow2.0

Hope you enjoyed it.

--

--