A Complete Step-wise Guide on Neural Style Transfer

14 min readMar 24, 2020

______________________________________________________________

Neural style transfer(NST) is an image stylization technique which uses Deep Learning at its core. With NST one can create artistic pieces using machine learning. NST takes two images named as the content image and the style image and stylizes the former in the style of the later. This is achieved by transferring the style of the style image to the content image. But what does the STYLE of an image mean?

Style of an image refers to the texture, the brush strokes, the geometrical shapes and the spatial colour distribution in any image.

Let’s see the result of neural style transfer.

In Figure 1, the left-most image(a) is the content image, the middle image(b) is the style image and the right-most (c) image is the corresponding styled image, the result of NST.

From the above results, you can definitely understand it is not just a simple image overlaying process. Image overlaying and Image Stylization are two completely different processes. Image overlaying is a process of overlapping one image on another image. A SIMPLE IMAGE OVERLAPPING. But NST presents one image in the style of another image. Refer to the images below.

Figure 2: Image Overlaying vs Image Stylization.

In Figure 2, (a) is the result of image overlaying and (b) is the result of image stylization. I hope with these images you are clear with the difference between both these processes.

Let’s move on.

After understanding what NST is and how it is different from simple image overlaying, let’s see what are the steps involved in NST. The steps are:

STEP 1: Choose the Content and Style Image.

STEP 2: Preprocess the image.

STEP 3: Generate a random image of the same size as content and style image.

STEP 4: Design the model.

STEP 5: Calculate Loss.

STEP 6: Optimize unless converged.

That’s all!!!!! Pretty easy right??

Let’s understand each of the steps in detail and make our hands dirty with some codes. For implementation, I have used Google Colab.

We are going to use TensorFlow 2.0. So, first of all, we need to install that. Use the following code for it.

!pip install -q tensorflow-gpu==2.0.0-beta1import tensorflow as tf

Next, we need to import important libraries.

import IPython.display as displayimport matplotlib.pyplot as pltimport matplotlib as mplimport numpy as npimport time

STEP 1 & STEP 2

In the first step, you will be selecting two photos. One as the content image and the other as the style image.

content_path="/content/1.jpg"style_path="/content/2.jpg"#Here you need to change the image paths to your paths.

In the second step, some preprocessing is required with the images. Firstly we will be resizing the images. Here we are resizing them depending on the image dimensions.

def load_img(path_to_img):
  max_dim = 512
  img = tf.io.read_file(path_to_img)#reads the imagefrom its path.
  img = tf.image.decode_image(img, channels=3)#detects the type of      image and converts it to tensor  img = tf.image.convert_image_dtype(img, tf.float32) #converts the pixels to float  shape = tf.cast(tf.shape(img)[:-1], tf.float32) #takes height and width as float  long_dim = max(shape)
  scale = max_dim / long_dim
  new_shape = tf.cast(shape * scale, tf.int32)
  img = tf.image.resize(img, new_shape)
  img = img[tf.newaxis, :]# increases the dimension  return img

The above method load_img has a lot of comments which tells you what is happening over there.

Next is a small method to display the image.

# to display an imagedef imshow(image, title=None):  if len(image.shape) > 3:    image = tf.squeeze(image, axis=0)  plt.imshow(image)  if title:    plt.title(title)

We had to write our custom imshow method because, in our load_img method, we are including one more dimension in our input image. Because of this extra dimension, the default imshow function doesn't show this image on the screen.

Now let’s use these defined methods to read and process our images. Use the following code:

content_image = load_img(content_path)
style_image = load_img(style_path)
plt.subplot(1, 2, 1)
imshow(content_image, 'Content Image')
plt.subplot(1, 2, 2)
imshow(style_image, 'Style Image')

You will be getting something like this as the results.

Figure 3: Results of the above code when executed.

STEP 3

Now we need to generate a random image. This image will be modified later such that it has content similar to the content image and style similar to style image. This image will have a size similar to the content and style image. We will see the code for it in a while.

STEP 4

At this stage, we are ready with the data. Now we will be sending to our neural network. For this project, we will be using Transfer Learning. For those who don’t know what is Transfer Learning let me explain this to you. Transfer learning is a technique where we use the knowledge learned while solving a problem and apply it to a related but same problem. For example, the knowledge gained in recognizing cars can also be applied to recognize trucks. We have many models which are pre-trained on large image dataset. These models have learnt to extract the features from the image. These learning have been stored in the form of the weights of the model. We will simply use them to extract the style and content of our images. Here, we will be using VGG19. Below is the architecture of VGG19.

VGG19 is a State-of-the-art image classifier. As in our case, we don’t want to classify images, so we will not use its final layers.

Using the following code, you will load the model VGG19.

vgg = tf.keras.applications.VGG19(include_top=False, weights='imagenet') # taking vgg 19 model without the later output layers.# include_top=false means you don't want to include the final layers.

Now, to see how the layers are named, execute this.

for layer in vgg.layers:
  print(layer.name)

It will give output something like this.

Figure 5: Results of the above code when executed

To extract the content and style information of an image, we will use different intermediate layers of the model. Let’s see how and why.

To extract the content:

VGG19 is a convolutional neural network(CNN). We all know that CNN takes the raw image as input pixels and generates an internal representation that converts the raw image pixels into a complex understanding of the features present within the image. Simply speaking, CNN has the ability to extract features from the image pixel values and represent it in a numerical way. As we move deeper into the network these features get more complex and near the final CNN layers, the best feature representation of the image is found. Taking this fact into consideration, we will take a layer from the final block of the model for content extraction as it represents the feature of the image well. In our case, we are taking the block5_conv4 layer.

Execute the following code to do that.

# Content layer where will pull our feature mapscontent_layers = ['block5_conv2']num_content_layers = len(content_layers)

To extract the style:
At each layer, CNN keeps learning a few features. The no. of channels in the output of a filter is equal to the no. of features learnt at that layer. In the starting layers, the network learns simple features such as detecting horizontal, vertical or diagonal lines in the first layer and detecting corners in the second layer and so on. As at each layer some of the other patterns are being detected therefore we will use multiple layers( one from each block) to extract the style information. In our case, we will be using [‘block1_conv1’, ‘block2_conv1’,’block3_conv1', ‘block4_conv1’, ‘block5_conv1’].

Execute the following code to do that.

# Style layer of intereststyle_layers =['block1_conv1','block2_conv1','block3_conv1','block4_conv1','block5_conv1']num_style_layers = len(style_layers)

Instead of the above-mentioned layers, you can experiment with any other layer. Just keep the facts mentioned above in mind.

Now we are going to add one more method to this project which is given below.

def vgg_layers(layer_names):# Load our model. Load pretrained VGG, trained on imagenet data  vgg = tf.keras.applications.VGG19(include_top=False, weights='imagenet')  vgg.trainable = False  outputs = [vgg.get_layer(name).output for name in layer_names]  model = tf.keras.Model([vgg.input], outputs)  return model

This method, vgg_layers, Creates a vgg model that returns a list of intermediate output values.

Once the model is ready we will send the input to the model and get the output. We will see the code for this part in just a moment.

STEP 5

The output got in STEP 4 will now be used to calculate the loss. Let’s see how the loss is defined in NST. The total_loss is divided into three parts.

1). Content Loss: Calculates the difference between the content (pixel values) of content image and generated image.

2). Style Loss: Calculates the difference between the style of content image and generated image.

3). Variation Loss: Calculates the variation among the neighbouring pixel values of the generated image.

Let’s look at all of them in detail.

i). Content Loss: As mentioned earlier, it is used to check how much the generated image is similar to our content image. We know that the information about the content of any image will be simply given by the activations of the neurons in different layers. The deeper the layer, the more information it stores. It is simply calculated as the euclidean distance between the activations or feature representation of content and generated images at different layers. The formula for Content Loss is given below.

ii). Style Loss: Similar to content loss, Style Loss is used to check how much the style of the generated image differs from the style of my style image. But the difference between both is that the style of any image is not simply represented as in the case of content. So how can we get the style representation of any image? The answer is the Gram Matrix.

Now what is that and how does it represent the style???? Let’s see.

Figure 7: Different channels of feature maps

The above figure shows different channels of feature maps in any particular layer (say l). At this layer, each channel of this feature map represents the different features present in any image. Now if we can anyhow find the correlation between these features, we can get the idea of the style as correlation is nothing but the co-occurrence of the features. But what about the mathematics behind this? So here it is.

Let’s understand this using two vectors a and b. Correlation between these two vectors can be calculated by their dot products. Refer to the image below.

Figure 8: Vectors ‘a’ and ’b’ represented in a 2D plane.

In the above figure, a and b are represented in a plane. Here, the more correlated a and b are, the more closer the vectors are, the more similar they are, the less is the theta between them, the more is the cosine of theta between them and at last, the more is the dot product between them.

i.e. the more correlated a and b are, the more is the dot product between them.

The dot product of a and b will be calculated like this:

Figure 9: Dot product of ‘a’ and ‘b’ vectors.

Now let us come back to the feature map shown in Figure 7. Here, if we can find the correlation between the channels, we can get to know which features co-occur. This will give us an idea of the style of an image.

Let’s see an example. Refer to Figure 7. Suppose that the red channel represents the feature of black strips, the yellow channel represents the presence of yellow colour and the green channel represents the presence of white strips. If red channel and yellow channel are fired up with high activation values,i.e. they co-occur then we can say that the image was of a tiger. These two channels will have a higher correlation than that between red and green channels. We know that this co-occurrence can be calculated by calculating the correlation. This correlation of all these channels w.r.t each other is given by the Gram Matrix of an image. We will use the Gram Matrix to measure the degree of correlation between channels which later will act as a measure of the style itself. The formula of the Gram Matrix is given as:

Wait!!!!!

But how does this thing looks with our input which are activations of any layer of the neural network or simply speaking, 3D arrays? Let’s see that.

First of all, these 3D arrays are opened and reshaped in the following order.

Figure 11: Opening and reshaping of the activation output

After this, the above formula in Figure 7 is applied. Here is one more example for you.

Let’s consider we get the activation values from the 10th layer which generates the feature representation matrix of size 28*28*512. The activation will look similar to this.

Figure 12: feature representation matrix of the 10th layer of VGG19

In order to find the Gram Matrix of it, the following steps will be followed:

I hope now I am clear with this that a Gram Matrix of any image represents the style of any image.

Now we will define the following method to find the Gram Matrix.

def gram_matrix(input_tensor):
  result = tf.linalg.einsum('bijc,bijd->bcd', input_tensor, input_tensor) #forms the gram matrix.
  input_shape = tf.shape(input_tensor)
  num_locations = tf.cast(input_shape[1]*input_shape[2], tf.float32)
  return result/(num_locations)

To form the gram matrix, tf’s einsum function is used which is inspired by Einstein summation convention. To know in detail about this function, go to this link.

Let’s come back to the style loss. We are comparing the style of two images here. After having a long discussion just now on style representation, now if I say in style loss we just need to find the difference between the Gram Matrix of both the images, you all will agree with me. And this is what was actually done. Below is the formula of Style Loss.

And this is the method for calculating both style and content losses.

def style_content_loss(outputs):
  style_outputs = outputs['style']
  content_outputs = outputs['content']  style_loss = tf.add_n([tf.reduce_mean((style_outputs[name]-style_targets[name])**2)for name in style_outputs.keys()])
 
#reduce_mean Computes the mean of elements across dimensions of a tensor.#add_n Adds all input tensors element-wise.  style_loss *= style_weight / num_style_layers#division is done so as to get style loss per layer  content_loss = tf.add_n([tf.reduce_mean((content_outputs[name]-content_targets[name])**2)for name in content_outputs.keys()])  content_loss *= content_weight / num_content_layersloss = style_loss + content_lossreturn loss

iii). Variation Loss: It wasn’t included originally in the paper. After noticing that reducing only the style and content losses led to highly noisy outputs and to reduce that noise, variation loss was also included in the total_loss of NST. This loss ensures spatial continuity and smoothness in the generated image to avoid noisy and overly pixelated results. This is done by finding the difference between the neighbour pixels.

The method for this is here:

def high_pass_x_y(image):
  x_var = image[:,:,1:,:] - image[:,:,:-1,:]
  y_var = image[:,1:,:,:] - image[:,:-1,:,:]
  return x_var, y_vardef total_variation_loss(image):
  x_deltas, y_deltas = high_pass_x_y(image)
  return tf.reduce_mean(x_deltas**2) + tf.reduce_mean(y_deltas**2)

Here the difference between neighbour pixels is found dimension wise.

The total loss is the weighted sum of these three losses. These weights decide how much impact will style or content will have on the total loss. Simply speaking, if you want to have more style than content in the resultant image, put a higher weight on the style loss which makes the total_loss to be more dependent on style loss and optimizing total_loss will focus more on optimizing style loss. Hence making the resultant image more similar to style image than content image. So let’s assign the weights for all the losses.

style_weight=1e-2
content_weight=1e4
total_variation_weight=1e8

STEP 6

In this step, we need an optimizer. We can use Adam Optimizer right now. An optimizer tries to reduce the loss by finding the optimum value of the parameters. Here the parameters are the pixel values of the generated image. Initially, these values are some random values or noise. These values are compared with the content image and style image. After the comparison, the loss is calculated. Adam optimizer tries to reduce the loss by updating the values of the generated image.

Execute this for assigning the optimizer:

opt = tf.optimizers.Adam(learning_rate=0.02, beta_1=0.99, epsilon=1e-1)

Now let us code the remaining things.

Below is my main class.

class StyleContentModel(tf.keras.models.Model):
  def __init__(self, style_layers, content_layers):   super(StyleContentModel, self).__init__()
   self.vgg =  vgg_layers(style_layers + content_layers)
   self.style_layers = style_layers
   self.content_layers = content_layers
   self.num_style_layers = len(style_layers)
   self.vgg.trainable = False  def call(self, inputs):
   "Expects float input in [0,1]"    inputs = inputs*255.0    preprocessed_input = tf.keras.applications.vgg19.preprocess_input(inputs) # ENCODES A BATCH OF IMAGES TO THE VGG19 MODEL
# this step does the remaining preprocessing like mean subtraction on the image before sending it to the model   outputs = self.vgg(preprocessed_input)
   style_outputs, content_outputs = (outputs[:self.num_style_layers],outputs[self.num_style_layers:])
   style_outputs = [gram_matrix(style_output)for style_output in style_outputs]   content_dict = {content_name:value for content_name, valuen in zip(self.content_layers, content_outputs)}
   style_dict = {style_name:value for style_name, value in zip(self.style_layers, style_outputs)}  return {'content':content_dict, 'style':style_dict}

The _init_ method here is the normal initialization method. The call method , when called on an image, returns the gram matrix (style) of the style_layers and content of the content_layers.

Next, I am going to create an object of this class.

extractor = StyleContentModel(style_layers, content_layers)

Now I can use this object. Let’s do that.

style_targets = extractor(style_image)['style']
content_targets = extractor(content_image)['content']

Here I am sending my style image and content image to the model and storing the results in style_targets & content_targets respectively. That means, style_targets is holding the style of the style image and content_targets is holding the content of the content image.

You remember we haven’t created the generated_image. It’s time to do that.

image=np.random.randn(content_image.shape)
image=tf.convert_to_tensor(image)

And now, one last method before the actual training step.

def clip_0_1(image):
  return tf.clip_by_value(image, clip_value_min=0.0, clip_value_max=1.0)

Since the model accepts values between 0 and 1, therefore, we will use this method to clip the values.

Now I am going to define my training step.

@tf.function()def train_step(image):
  with tf.GradientTape() as tape:
    outputs = extractor(image)
     loss = style_content_loss(outputs)
     loss += total_variation_weight*total_variation_loss(image)  grad = tape.gradient(loss, image)
  opt.apply_gradients([(grad, image)])
  image.assign(clip_0_1(image))

Notice how we are finding the gradient and how it is being sent to our optimizer.

Finally, we will call this method.

import timestart = time.time()
epochs = 1000
steps_per_epoch = 50
step = 0for n in range(epochs):
  for m in range(steps_per_epoch):
    step += 1
    train_step(image)
    print(".", end='')  display.clear_output(wait=True)
  imshow(image.read_value())
  plt.title("Train step: {}".format(step))
  plt.show()end = time.time()
print("Total time: {:.1f}".format(end-start))

After executing this cell, you will get to see how the generated image changes with every 100 epochs.

After the complete process, the generated image will be the requires content+style image.

So this was all about Neural Style Transfer. I hope you got to learn new things and clear everything completely. I tried my best to convey every information about NST through this blog but if I still missed something then please comment about it.

Don’t forget to clap if this was helpful.

Happy Learning!!