Content and style loss using VGG network

Explained and visualized in simple terms with TensorFlow

Oleksandr Savsunenko

--

We all know about Prisma app, that introduced style transfer to public and last year showed a plethora of advances in this technology. For a great review and code, please, check the fast-style-transfer Git repo.

For those, looking for a simple explanation and tips on implementation VGG-based style, content loss and Gram matrices — continue reading.

VGG network features and implementation

VGG19 network was initially developed for image classification competition ImageNet ILSVRC-2014, scoring 7.3% error rate. That was a breakthrough of the year. VGG19 is simple, fast and easy to understand. It has 5 stacks of convolutional layers named accordingly conv1 to conv5, each having from 2 to 4 layers, named conv1_2…conv5_4. They are followed by fully connected layers, that are actually doing classification.

The idea behind is simple — convolutional layers do “feature extraction” acting as perception fields, finding patterns and geometrical shapes of progressing complexity and fully connected layers act as a classical perceptron, classifying objects based on what shapes were present in the image.

As a by-product, pre-trained VGG network with detached fully convolutional layers acts as a pure feature and pattern extractor.

To implement VGG network in Tensorflow one can use number of pre-crafted codes, coming from TensorFlow Slim or use independent GitHub repos. I prefer working with this repo, that makes network building as simple as:

vgg = vgg19.Vgg19()
vgg.build(images)

Content loss

To use VGG19 for content loss (also called perсeptual loss) in applications like super-resolution one extracts weights from different layers of VGG19. The deeper layer you take, the more network focuses on general details and patterns. To demonstrate this let’s take a simple use-case. We will try to reconstruct the image from white noise using gradient descent and see what kind of result we obtain using different layers of VGG19. Or test image is 400x600 pixels. Here’s a basic pseudocode in Tensorflow:

def model(self, input, reuse=False):
with tf.variable_scope('model', reuse=reuse):
tf.identity(input, name="input_image")
coefs = tf.get_variable("my_variable", [1, 400, 600, 3], initializer=tf.random_normal_initializer())
output = input * coefs
tf.identity(output, name="output_image")

return tf.nn.tanh(output)
vgg = Vgg19()
vgg1 = Vgg19()
vgg.build((target_image + 1)/2)
vgg1.build((network + 1) / 2)
loss = tl.cost.mean_squared_error(vgg.conv1_1, vgg1.conv1_1, is_mean=True)

Here’s what happens.

Based on content loss, we are able to reconstruct image almost perfectly with initial convolution layers(conv1-conv3), quality starts to degrade slowly into 4th and 5th layers, as network starts to capture only broader details.

Style loss

For Prisma-style effects, you would need to extract only texture information from the image. There is a number of ways to do that including Random Markov Field and Gram Matrix, based on VGG network. The simplest one is Gram matrix, that is basically (in an inner product space) the Hermitian matrix of inner products (Wiki).

While this might look complicated, the actual realization is pretty simple. To get a style of image you would need to get a Gram Matrix (inner dot product) of vgg layer multiplied on transposed self. First, you vectorize (flatten) VGG layer content and then you do multiplication. Obtained matrix can be directly used to get style loss for training.

With same initial image we experiment with losses from different layers.

vector_target2 = tf.reshape(vgg.conv5_1, [-1, tf.shape(vgg.conv5_1)[1] * tf.shape(vgg.conv5_1)[2], 512])
vector_output2 = tf.reshape(vgg1.conv5_1, [-1, tf.shape(vgg.conv5_1)[1] * tf.shape(vgg.conv5_1)[2], 512])
style_loss = tl.cost.mean_squared_error(tf.matmul(vector_target2, vector_target2, adjoint_a=True)/(50*75*256), tf.matmul(vector_output2, vector_output2, adjoint_a=True)/(50*75*256), is_mean=True)

The result is following.

As you can see, style loss based on initial layers keeps spatial awareness of target image and starts to hallucinate heavily with deeper layers. Layers 3 to 5 are typically used in style transfer applications.

What’s next?

We are just starting our Medium blog, so follow us and comment if you like what you read to get deeper into image processing and generation with Tensorflow.

We use described approaches in our product — Letsenhance.io

Alex, CEO of Let’s Enhance

--

--

Oleksandr Savsunenko

Research Engineer at Ring, doer/maker/dreamer, father. Author of unbiased experimentation blog FridayExperiment.com