Neural Style Transfer (Part 1)
Convolutional Neural Networks have played a very huge role in solving problems related to images using deep learning. Popular use case includes image recognition/classification, object detection, image generation, and many more.
A simple (and very common) structure of a deep learning architecture for image problems can be seen in the image below.
The diagram in the image represents two sections. The first consists of convolutional layers (hence the name CNN) which is used for extracting features from an input image, while the second consists of fully connected layers and is where the main classification is done at the output.
What is Neural Style Transfer?
According to Tensorflow’s official documentation,
Neural style transfer is an optimization technique used to take two images — a content image and a style reference image (such as an artwork by a famous painter) — and blend them together so the output image looks like the content image, but “painted” in the style of the style reference image.
Furthermore, they explained
This is implemented by optimizing the output image to match the content statistics of the content image and the style statistics of the style reference image. These statistics are extracted from the images using a convolutional network.
Now we can see where the Convolutional Neural Network we discussed comes in :).
From the explanation above, it is obvious that in the case of Neural Style Transfer, we are NOT building a classifier, so we can modify the diagram of the CNN architecture to something like this.
Understanding the Architecture
Before we dive into the code and building our own style transfer network, it is very important to understand the Architecture used.
From the paper Image Style Transfer Using Convolutional Neural Networks, style transfer uses the features in the 19-layer VGG Network. The network comprises of a series of convolutional and pooling layers, and some fully-connected layers (which we do not need).
The image above, gotten from an Udacity deep learning notebook on style transfer explains how the VGG network is structured. The convolutional layers are named by stack and their order in the stack. Conv_1_1 is the first convolutional layer that an image is passed through, in the first stack. Conv_2_1 is the first convolutional layer in the second stack and so on. The deepest convolutional layer in the network is conv_5_4.
Understanding the method (Plan of Attack)
First, we all have to agree that images have both content and style. See the image below.
In simple terms, we want to pass in two images, extract the content of the first image, extract the style of the second image and merge them together to form a target image.
Content Extraction
According to the paper,
We reconstruct the input image from from layers ‘conv1 2’ (a), ‘conv2 2’ (b), ‘conv3 2’ ©, ‘conv4 2’ (d) and ‘conv5 2’ (e) of the original VGG-Network. We find that reconstruction from lower layers is almost perfect. In higher layers of the network, detailed pixel information is lost while the high-level content of the image is preserved.
This makes sense because the essence of a convolutional network is to extract features so that the fully connected layers can identify the content in the image. The “high-level content” refers to the content in the image, and as the input image goes through the network, it starts by learning low-level features, then later starts learning high-level features, which is what we want.
In our case, we will make use of the conv4_2 layer
.
Style Extraction
According to the paper,
The style representation computes correlations between the different features in different layers of the CNN. We reconstruct the style of the input image from a style representation built on different subsets of CNN layers ‘conv1 1’ (a), ‘conv1 1’ and ‘conv2 1’ (b), ‘conv1 1’, ‘conv2 1’ and ‘conv3 1’ ©, ‘conv1 1’, ‘conv2 1’, ‘conv3 1’ and ‘conv4 1’ (d), ‘conv1 1’, ‘conv2 1’, ‘conv3 1’, ‘conv4 1’
and ‘conv5 1’ (e)
Unlike content extraction, style extraction involves different layers of the CNN architecture, as well as finding the correlations between the features in each layer.
The question is ……. How do I find the correlation between these features?
Gram Matrix to the rescue!
Gram Matrix is used to determine if two matrices (in this case, filters) are correlated. It is achieved by calculating the dot-product of the vectors of the two filters. The matrix thus obtained is called Gram Matrix.
If the dot-product across the two filters is large then the two are said to be correlated and if it is small then the images are un-correlated.
Since we are finding the correlations between the different features in different layers, we calculate the gram matrix by multiplying the output of each style layer by its transpose.
The CoDe!
Probably what we’ve been waiting for :)
The first step is to load the VGG 19 model. My implementation is in PyTorch (sorry Tf.keras guys), and from torchvision
we can import
the VGG 19 model
.
Next, we want to get only the feature portion of the VGG 19 model and also freeze all parameters, since we are not updating our weight parameters.
If you print the variable vgg
, you will see a summary of the VGG 19 architecture, showing each layer represented by numbers.
Next thing we want to do is to get the features from each of the layers mentioned in the paper.
the layers object is used to store the number representation of each layer we will use. Note that key 21 (conv4_2)
is the content representation.
Now we can loop through and get the output (features) at each layer for both the content and the style image.
We are done getting all we need for the content according to the paper, but we are not done for that of the style. We need to write a function that takes in the features of each layer and compute the gram matrix.
The Gram matrix of a convolutional layer can be calculated as follows:
- Get the depth, height, and width of a tensor using
batch_size, d, h, w = tensor.size
- Reshape that tensor so that the spatial dimensions are flattened.
- Calculate the gram matrix by multiplying the reshaped tensor by its transpose.
Content and Style Weight
From the paper, we define an alpha
(content_weight) and a beta
(style_weight). This ratio affects how stylized your final image is. Larger content weight means you want more of the content from the content image in your final image and vice versa.
It is best to use a larger weight value for the style. Udacity even recommended that you leave the content_weight = 1 and set the style_weight to achieve the ratio you want.
Also, we can define weights for the different layers used in getting the style from the style image. We would want larger weights for the earlier layers since as we go through the network, pixel information gets lost.
Calculating the Losses
Content Loss
The content loss is the mean squared difference between the target and content features at layer conv4_2
.
content_loss = torch.mean((target_features['conv4_2'] - content_features['conv4_2'])**2)
Style Loss
We calculate the gram matrix for the target image and style image at each of the layers represented in the dictionary style_weights
and compare the gram matrices, calculating the layer_style_loss
. We then sum the losses at these layers to obtain the total style loss.
Total Loss
Finally, we create the total loss by adding up the style and content losses and weighting them a specified alpha and beta.
Optimization
Since we are interested in modifying the target image to have content of content image and style of style image, therefore, we optimize the target image.
Creating a target image
When creating the target image, it is best to use a copy of the content image, then adjust the style iteratively
target = content.clone().requires_grad_(True).to(device)
Sample Output
After 2000 iterations, I was able to achieve this!
If you have any question or concerns, let me know on the comment section. Access the full codes here.
Thanks!
References