Basic Intuition And Guide to Neural Style Transfer

7 min readJan 7, 2024

Simple explanation behind the idea of neural style transfer and implementation with PyTorch.

Introduction

Neural Style Transfer, in short NST, is an interesting idea where neural networks learn to transfer style i.e. it learns how to paint and generate a new image with a unique painting.

The concept of style transfer is to transfer a style or texture to an input image, and the method to achieve a style-transferred image with a deep neural network called Neural Style Transfer. It means NST is a technique that takes two images — a content image and a style reference image (that can be an artwork by a famous painter or a new texture) — and mingles them together so the output looks like the content image, but “painted” in the style of a reference image. The below picture depicts the output of NST.

Let’s dive deep into this!

Actually, it is slightly different from how the neural networks work. We all know that in a neural network weights get updated and backpropagated where the image is the input and the network tries to learn how to recognize it in case of image recognition by CNN, for example.

However, In NST we update the pixels of an image ( Weights to be updated of a result image) and the frozen weights of the Pre-Trained Network. We keep updating the final image pixels according to loss until it is desired output i.e minimum loss.

Update Image Pixels and freeze pre-trained Network weights

Pretrained Network (VGG, ResNet)is used to extract the features from the content image and style image, and some mechanism is used to update the pixels of the style-transferred image i.e. final output.

That’s why only single-style images are to be transferred on corresponding single-style images and not the dataset used. One-to-one correspondence between style and content image.

Components:

Pre-trained CNN on Images like VGG and ResNets.

Content Image: An image where we want to transfer style.

Style Image: An image in which style is to be transferred.

Generated Image: An image that contains the final result(pixels i.e weights)

How does it work?

We have three Pre-Trained let’s say VGG (frozen weights) networks for each content image, style image, and generated image respectively. The main goal is to maintain details of both content and style images in the generated image.

Copy the content image details to the new image such that minimizes the content distance from the content image.
Copy the style image details to the new image such that minimizes style distance with the style image.
Minimizing distance means minimizing loss between the image to the generated image.

Important Steps involved:

Compute features using pre-trained models like VGG, and ResNet for each content, style, and generated image.
Compute content loss and style loss.
Compute total combined loss.
Backpropagate Gradient to update generated image weights pixels, while pre-trained models weights were frozen.

STEP 1:

STEP 2: Compute Content And Style Loss

Content Loss:

We can copy the details of a content image to the generated image by a loss function called Mean Square Error (MSE) between the last feature maps of the content image and the generated image which minimizes the distance from the content image and generated image i.e. it indicates that details of the content image are being transferred to the generated image.

Why do we compare only the last layer feature maps of a content image?

In deep convolutional layers, different layers learn different features. The first convolutional layers learn features such as edges and simple textures. Later convolutional layers learn features such as more complex textures and patterns. The last convolutional layer learns features such as objects or part of objects. So, we need important objects’ characteristics of content to generate images.

Five Layers of Convolutional Outputs of each layer: Source

Style Loss:

Its main objective is to enforce the details of style images in the generated image. For that, there should be a similar correlation of activations between the style image and the generated image i.e. measured by a correlation matrix called Gram Matrix.

Gram Matrix (Style Matrix) Concept

Compute Gram Matrix

The gram matrix is used to capture the “distribution of features” of a set of feature maps called the style matrix. The above picture depicts that concept.

The gram matrix is a correlation operation i.e. dot product of feature maps at a layer that summarizes the activations that co-occur. Texture (style) has a strong locality and when we capture activations that co-occur a lot — we capture locality. What’s being measured is whether, at a particular pixel position, feature #F1 tends to cooccur with feature #F2.

By finding this matrix, we get encoded correlated activations closer to the target( the style we want to capture) and retrieve the style. And, the Gram matrix is position invariant — it’s based on statistics at individual points in feature maps.

Steps used in Style Loss:

First, compute the gram matrix of the feature map of layer I of the style image and the gram matrix of the feature map of layer I of the generated image.
Find the Mean Square Error between the gram matrix of the style image and the gram matrix of the content image which is called style loss.
Compute Style Loss of all layers of style image feature maps and generated image feature maps as shown in the figure below.

MSE between generated and style image gram matrix.

Similarly, Compute style loss in all layers to preserve every detail (style) as explained above in the generated image.

WHY?

We need to transfer every detail of the style image to the generated image and need to compute style loss from each layer of the encoder(Pre-trained CNN Model) where different details are found.

Code:

STEP 3: Compute total combined loss.

Finally, the total loss is calculated as weighted loss = 𝜶 x Content Loss + 𝝱 x Style Loss. The α and β are used to control the amount of content and style presented in the generated image. You can also see a nice visualization of different effects of different α and β values in the paper. Our main job is to minimize this total loss using some optimizer like Adam to generate images that have both details of the content image and style image.

STEP 5: Backpropagate Gradient to update generated image weights pixels, while pre-trained models weights were frozen.

The main thing is there is a backpropagation of the gradient to update the weights of the generated image i.e pixels value. The following picture depicts the mechanism.

Code:

On training for let’s say 10000 epochs, the network learns to nicely transfer style into a generated image. When you run the above code for the single image then the result looks something like this:

That’s how Neural Style Transfer works.

Conclusion

In this tutorial, you learn how Neural Style Transfer works, and how to implement it with PyTorch. Neural style transfer puts together features of content and style—reference images into new art. It is different from the neural network mechanism where pixels of the generated image get updated, whereas weights of the pre-trained network are frozen. The first features are extracted from each content, style, and generated image. Initially, the generated image is a clone of the content image and on the training, the network tries to update generated image pixels including style image details. You also understand that two-loss content loss style loss and combined loss help us to achieve exactly what we want. Combine loss is calculated as weightage on content and style loss and during training with gradient backpropagation, pixels of the generated image get updated.

The code for this tutorial is available here.