Deep Learning Application I : Style Transfer

Published in

csmadeeasy

7 min readDec 9, 2018

In this post , I will try and guide you through Neural Style Transfer. Yes it’s pretty rinsed and repeated topic. We have some really great articles and rinsed and repeated experiments available about this one. I will just point to those first if you want deep exploratory analysis of the method.

Introduction to Style transfer

Caveat: Some knowledge of convolutional neural networks is assumed.

medium.com

Now I would recommend you to read the original Gaty’s paper which basically kick started everything in style transfer. You may not understand it in the first pass. But do give a read.

Just so that you don’t have to leave the page. I will try and have you follow up the paper along with the subsequent code to understand them better

The idea is we pass a content_target , style_target and a random noise (we pass the content as the noise) through a VGG-16 network. The activation from intermediate layer of VGG for the content_target , style_target and random noise is used in a loss equation which is meant to capture the idea of what content in an image is and what does style mean. The these loss equation is used to backprop which result in gradients also reaching the random noise , and we update the random noise based on some optimization method. In this case we will be using Adam.
PS: The Style part is still kinda black magic for me. Until the point we find a formal definition of what is a style in an Image.

This is what the internet will show you when you ask for a VGG network

In case you are unfamiliar with it, VGG-16 is a classification network and the core assumption is that VGG or any CNN classification network captures the image features in a representative manner in it’s hidden layer which contains information about it’s content hence it can be used for classification and also it’s style statistics.

This is what you will work with when programming (Keras variant)

This is what we will be building for style transfer.

Now we need to build a model in keras that gives the intermediate layers as output, this will help us to easily work with the activations.

content_weight = 1
style_weight = 1e3
style_weights = [0.2,0.2,0.2,0.2,0.2]content_layers = [
                    'block4_conv2',
                ]
style_layers = [
                    'block1_conv1',
                    'block2_conv1',
                    'block3_conv1',
                    'block4_conv1',
                    'block5_conv1',
                ]vgg_model = VGG16( weights = 'imagenet' , include_top = False , input_shape = (dimension , dimension , 3))vgg_model.trainable = False
for layer in vgg_model.layers:
    layer.trainable = Falsecontent_layer_tensors = [ vgg_model.get_layer(layer).output for layer in content_layers]
style_layer_tensors = [vgg_model.get_layer(layer).output for layer in style_layers]vgg_feature = Model(inputs = vgg_model.input , outputs = content_layer_tensors + style_layer_tensors)

Now we have a feature extractor model, so we now move to now extracting the feature of the images. But before we do that we need a bit of preprocessing.

def preprocess_image(image_path):
    img = load_img(image_path, target_size=(dimension, dimension))
    img = img_to_array(img)
#     img = np.expand_dims(img, axis=0)
    img = vgg16.preprocess_input(img)
    return img# util function to convert a tensor into a valid image
def deprocess_image(x):
#     x = x.reshape((dimension, dimension, 3))
    # Remove zero-center by mean pixel
    x = np.copy(x)
    x[:, :, 0] += 103.939
    x[:, :, 1] += 116.779
    x[:, :, 2] += 123.68
    # 'BGR'->'RGB'
    x = x[:, :, ::-1]
    x = np.clip(x, 0, 255).astype('uint8')
    return x

The VGG Network was trained with images whose means were 0 centred channel wise i.e the each channel had the mean as 0 which was achieved by subtracting the means of each channel through out the dataset. Also unlike Pytorch implementation Keras is not trained with normalised images.

content_img = preprocess_image(CONTENT)
style_img = preprocess_image(STYLE)

content_tensor = K.constant(content_img)
style_tensor = K.constant(style_img)
resultant_tensor = K.variable(content_img)

As the trick goes just set the resultant noise as the original content image. Makes the style transfer process faster also stable as I observed.

inputs = K.concatenate([K.expand_dims(resultant_tensor,axis = 0) , 
                        K.expand_dims(content_tensor , axis = 0) , 
                        K.expand_dims(style_tensor , axis = 0)] , axis = 0)
features = vgg_feature(inputs)
content_layers_len = len(content_layers) ## Just taking the length
style_layers_len = len(style_layers) ## Will come handy

We pass all these tensor at once. Mark this that now they are being passed as a batch so the output will have to accessed using their list indices. Now let’s start getting the losses. This is the Content Loss .

The content loss is the sum squared error loss of the individual features. : Content Loss

## Creating the content loss for the input
content_loss = 0
for i in range(content_layers_len):
    print(content_layers[i])
    loss = K.sum(K.square(features[i][0] - features[i][1])) / 2
    content_loss+= loss

Just mark the indices we take the activation from the resultant tensor and the activation from the content tensor and then compute the loss. Now we define the Style Loss.

First we need the gram matrix of tensor.

def get_gram_matrix(tensor):
    shape = K.int_shape(tensor)
    N = shape[2]
    M = shape[0]*shape[1]
    features = K.reshape(tensor , shape = (shape[0]*shape[1], shape[2]))
    return K.dot(K.transpose(features) , features ) / (2*N*M)

This is how you do the gram matrix. That extra division is only for the coefficient given in the next equation.

style_loss = 0
for i in range(content_layers_len , content_layers_len + style_layers_len):
    print(style_layers[i - content_layers_len])
    gram_result = get_gram_matrix(features[i][0]) 
    gram_style = get_gram_matrix(features[i][2]) ## 
    loss = style_weights[i - content_layers_len]  * K.sum((K.square(gram_result - gram_style)))
    style_loss += loss

Keep following the code and the formulas. The below one is actually already done in the 2nd last line of the above code. See the style_weights being multiplied for the individual layers. That’s what the below formula is about.

You can find much more explanation is the authors paper why this particular method is used. Because a Gram Matrix in this case only give the internal feature covariance. I understand it in the way that

“How much of a feature exists for some other feature”

You can always visualize what feature a particular layer is looking for but that’s already done in the below links. Before heading onto them know that by now you have already build the graph for the computation in tensorflow. All you need to do is start the backprop operation on it.

Now head on to the below links to understand how the layers effect. Mark that the code are in pytorch for the below article. So make sure you don’t get confused.

Experiments on different loss configurations for style transfer

As I participated in Jeremy Howard’s excellent Cutting edge deep learning for coders, I was quite intrigued by style…

towardsdatascience.com

Practical techniques for getting style transfer to work

Those who are not familiar with style transfer can read this blog post summarizing style transfer.

towardsdatascience.com

Hey you are back. So let us move on to the optimization process. Before we start optimization process we need one more loss. Which actually works only on the resultant tensor. This actually makes the image smooth. Though in my personal experience. Since i was working with very small images of size (256,256) maybe it didn’t effected the result much in an immediate notice.

def total_variation_loss(x):
    a = K.square(
        x[:dimension - 1, :dimension - 1, :] - x[ 1:, :dimension - 1, :])
    b = K.square(
        x[:dimension - 1, :dimension - 1, :] - x[ :dimension - 1, 1:, :])
    return K.sum(K.pow(a + b, 1.25))
tv_loss = total_variation_loss(resultant_tensor)

Total variational loss. Code courtesy Keras documentation and official github repository.

total_loss = content_weight*content_loss + style_weight*style_loss + tv_loss## Adding all lossesoptim = Adam(lr = 1)
update = optim.get_updates(loss = total_loss, params = [resultant_tensor])
iterate = K.function([] ,[total_loss , content_loss,style_loss]
                     , updates = update )

Now people who are used to working with Keras in terms of building model and compiling and then calling fit may find this new. First we take a Adam() class object which has the method get_updates. Get updates only returns gradients tensor for the params given in respect to the loss tensors that you want the gradient evaluated against. Imagine the total_loss to be the function and get_updates just differentiates it respect to the input parameter.

Iterate holds a Keras function which forms a function based on input which incase is none. Though we have few inputs but those are constant and do not needs to change all the time. If you know tensorflow then you know there are no placeholders in this case. The updates parameter takes the gradient tensor.

epochs = 10000
for epoch in range(epochs):
    loss = iterate(inputs = [])
#     print(loss)
    if epoch % 100 == 0:
        print("epoch:",epoch ,"total loss:",loss[0],"style loss:",loss[2],"content loss:",loss[1])
    if epoch % 500 == 0:
        plt.imshow(deprocess_image(K.get_value(resultant_tensor)))
        plt.show()

All we do is iterate over it now. Note that the inputs parameter is empty. You have to give an empty list otherwise it throws error.

I will definitely encourage to go through the code + the math behind this. Figuring out Neural Style transfer is not hard , getting it to work and controlling the transfer is where the engineering comes in. 😀

In the next post we will visualize the layer of VGG and look at what the Gram matrix is actually looking for in among the features.

Github link for Notebook : https://github.com/rajatkb/Neural-Style-Transfer-Keras