Image Style Transfer

Ting-Hao Chen
Machine Learning Notes
6 min readJan 22, 2018

In this blog, we are going to transfer the style from a famous painting to any photo using convolutional neural network.

If you are interested, the code (jupyter notebook with tutorial) of this post is available below.

We are going to use VGG19, a per-trained model, to build our algorithm.

VGG19 architecture. Photo credit: https://www.slideshare.net/ckmarkohchang/applied-deep-learning-1103-convolutional-neural-networks

The idea is first presented by Gatys et al. Basically, you’ll have two input images, a content image and a style image. And we wanna produce a mixed image which contains style (such as texture, colour) from style image and content from content image. In convolutional neural network, the feature maps in lower layers captures the low level features (i.e. content) and vice versa, the feature maps in higher layers captures the high level features (i.e. style). The image shown below illustrates the idea. During style reconstructions, the higher feature levels capture the style of the painting. On the other hand, during content reconstructions, the lower feature levels capture the content.

The idea propose by Gatys et al.

Measuring low level features would not be able to capture perceptual defferences between output and ground truth images. Gatys et al proposal using high level features to obtain the style of the image and therefore reduce the perceptual differences. Moreover, Johnson et al improves Gatys et al’s work by training a feed forward transformation networks ahead. A new faster architecture proposed by Johnson et al is shown below.

The architecture presented by Johnson et al.

So, let’s build an algorithm to do style transfer! First, we define the layers that we are going to extract.

STYLE_LAYERS = ['relu1_1', 'relu2_1', 'relu3_1', 'relu4_1', 'relu5_1']CONTENT_LAYER = 'relu4_2'

When calculating loss function for style features, we want to measure which features in the style layers activate simultaneously for the style image, and then copy this activation pattern to the mixed image.

One way of doing this, is to calculate the so called Gram matrix for the tensors output by the style layers. The Gram matrix is essentially just a matrix of dot products for the vectors of the feature activations of a style layer.

If an entry in the Gram matrix has a value close to zero then it means the two features in the given layer do not activate simultaneously for the given style image. And vice versa, if an entry in the Gram matrix has a large value, then it means the two features do activate simultaneously for the given style image. We will then try and create a mixed image that replicates this activation pattern of the style image.

# We are going to call VGG19 graph from outside, so we'll 
# have to set tf.Graph() as default.
with tf.Graph().as_default(), tf.Session() as sess:
# Expand new dim at axis = 0 for batch size.
style_target = np.array([style_target])

# Create a 4-D, float32 placeholder for style target.
style_image = tf.placeholder(tf.float32, style_target.shape)

# Style target minus mean pixel.
# mean pixel = np.array([123.68, 116.779, 103.939])
style_image = vgg19.preprocess(style_image)

# Let style target flow through VGG19 model.
net = vgg19.net(vgg_path, style_image)

for layer in STYLE_LAYERS:
# Feed style targets into style image
feed_dict = {style_image: style_target}
features = net[layer].eval(feed_dict)

# Reshape [batch_size, height, width, channel] to
# [batch_size * height * width, channel]
features = np.reshape(features, (-1, features.shape[3]))

# Calculate gram metrix and normalize it
gram = np.matmul(features.T, features) / features.size

# Store the gram metrices for loss function.
style_features[layer] = gram

Above is a style target flow through VGG19 net. We’ll need an input image to flow through image transform net and VGG19 net in order to calculate style loss function.

# Calculate style loss function.
style_losses = []
for style_layer in STYLE_LAYERS:
# Output features in each style layer.
layer = output_features[style_layer]

# Get the batch_size, height, width, channel from each style layer.
bs, height, width, channel = map(lambda i: i.value, layer.get_shape())

# Calculate the size.
size = height * width * channel

# Try to calculate gram metrix for output features in each style layer.
# Reshape output features in each style layer.
output_f_style = tf.reshape(layer, (bs, height * width, channel))

# Transpose output features.
output_f_style_T = tf.transpose(output_f_style, perm=[0,2,1])

# Calculate gram metrix for output features.
output_gram = tf.matmul(output_f_style_T, output_f_style) / size

style_gram = style_features[style_layer]

# Calculate L2 loss function between output features and style features.
# L2 loss without half the norm.
l2_loss = 2 * tf.nn.l2_loss(output_gram - style_features[style_layer])

Next, calculate content loss function.

# Create a 4-D, float32 placeholder for content target.
X_content = tf.placeholder(tf.float32, batch_shape)
# Content image minus mean pixel.
content_pre = vgg19.preprocess(X_content)
# Create a dict for storing content features.
content_features = {}
# Let content target flow through VGG19 model
content_net = vgg19.net(vgg_path, content_pre)
# Store the content features for loss function.
content_features[CONTENT_LAYER] = content_net[CONTENT_LAYER]
# Let input image flow through image transform net.
output_image = image_transform_net.net(X_content / 255.0)

# Output image minus mean pixel.
output_pre = vgg19.preprocess(output_image)
# Let output image flow through VGG19 model in order to calculate
# the features of output image.
output_features = vgg19.net(vgg_path, output_pre)
# Calculate the size of content features
content_size = tensor_size(content_features[CONTENT_LAYER]) * batch_size
# Calculate L2 loss function between output features and content features.
# L2 loss without half the norm.
l2_loss = 2 * tf.nn.l2_loss(
output_features[CONTENT_LAYER] - content_features[CONTENT_LAYER])

Moreover, let’s create a loss function for denoising the mixed-image. The algorithm is called Total Variation Denoising and essentially just shifts the image one pixel in the x and y axis, calculates the difference from the original image, takes the absolute value to ensure the difference is a positive number, and sums over all the pixels in the image. This creates a loss function that can be minimized so as to suppress some of the noise in the image.

Y_denoise = 2 * tf.nn.l2_loss(output_image[:, 1:, :, :] - output_image[:, :-1, :, :])
X_denoise = 2 * tf.nn.l2_loss(output_image[:, :, 1:, :] - output_image[:, :, :-1, :])
# Calculate the size of Y_denoise and X_denoise.
Y_size = tensor_size(output_image[:,1:,:,:])
X_size = tensor_size(output_image[:,:,1:,:])
# Normalize total variation denoising and multiply its weight (a.k.a. lambda).
denoise_loss = \
denoise_weight * (X_denoise / X_size + Y_denoise / Y_size) / batch_size

Therefore, the total loss function is to add all of them.

# Total loss.
loss = content_loss + style_loss + denoise_loss

The main optimization algorithm for the style transfer algorithm is basically just gradient descent on the loss functions. We use Adam as our optimizer.

# Use Adam as our optimizer.
optimizer = tf.train.AdamOptimizer(learning_rate)

# Minimize the loss function.
train_op = optimizer.minimize(loss)

After training, we save the session as a ckpt file for evaluation. Now let’s play with it! In this demo, I am going to use Taipei 101 building as a content image.

Use la muse painting as the style image.
Use rain princess as the style image.
Use scream as the style image.
Use udnie as the style image.
Use wave as the style image.
Use wreck as the style image.

Learn more on my Github repository and don’t forget to give me a clap! Ciao!

--

--