Neural Style Transfer VGG19

Nelson Punch
Software-Dev-Explore
9 min readJan 16, 2024
Photo by Alina Grubnyak on Unsplash

Introduction

There are 2 techniques appear in horizon when it comes to styling an image. They are both GAN and Neural Style Transfer. GAN stand for Generative Adversarial Network, which is a generative model and use 2 networks compete each other to achieve style transfer outcome. Neural Style Transfer on the other hand use pre-trained model and algorithm to produce desired result.

Both approaches are quite different but they are in the same domain. GAN is modern approach to generate styled image and it is much faster but I am going to take look at Neural Style Transfer and learn how it works under the hood.

There is an paper about Neural Style Transfer.

Objective

The objective here is given 2 images. One image is a content (original) image and an other image is style image then apply style image on content image. The final produced image should have content image with particular style from style image.

Style image here refer to an image painted by painter(Picasso, Van Gogh…) with certain unique style whereas content image refer to any kind of image

I will use a painting The Scream by Edvard Munch for the style image.

Style image

For Content image.

Content image

Final styled image.

Content image with style

This is final produced image with styled and this is my primary objective.

VGG19

VGG19 is a computer vision model and capable of classify 1000 objects. The model had been trained with millions of images. All computer vision model are base on Convolutional Neural Network(CNN) including VGG19.

Convolutional Neural Network

Convolutional Neural Network is able to learn features(patterns) from image dataset in order to classify number of different image.

A generic CNN consist of an Input layer, number of blocks, Fully-connected layer and Output layer. Each blocks have 1 or more Convolutional layer and ReLU layer, Pooling layer is always at end of block.

A typical Convolutional Neural Network

In the image Convolution layer, ReLU layer and a Pooling layer all together form as a block. Image get passed into the block and fully connected layer in order to classify image. Convolution layer is the key in whole network because it can learn features from images. Here is a CNN explainer explained CNN in an interactive way.

  • Convolution layer: a layer can learn features from image
  • ReLU layer: an activation layer. There are many other different activation layers but ReLU is a common one of all
  • Pooling layer: a layer to condense image into smaller size of image and maintain as much information as possible from original image.

Convolution layer is always follow by a ReLU layer and only 1 Pooling layer at end of block

VGG19 Network

VGG19

The different between VGG19 and a typical CNN is VGG19 include more blocks and in each blocks contain at least 2 Convolutional layer. In addition it has 3 Fully-connected layers.

From the image I can see 5 blocks(green + blue as 1 block) in entire network. The minimum Convolutional layer in each block is 2 for example the green square at most left with conv1_1 and conv1_2. conv1_1 the first number means block and second number means layer. For example conv4_3 means Convolutional layer 3 in block 4.

ReLU layer is not shown in the image because ReLU layer is always after Convolutional Layer.

Another thing is depth. Depth also known as feature maps so if depth is 64 then the Convolutional layer will produce 64 feature maps.

  • Feature maps(Depths): similar to color channels in an image but it contain features(patterns) from an image. Each of feature maps are different from each others.

maxpool in the image is a type of Pooling layer.

Here is a snapshot of feature maps from conv1_1(first layer in block 1) in VGG19 with polar bear image.

For conv5_1(first layer in block 5)

Features from an image are aggregated and more clear when it approach to the end of block. This is the how CNN learn features to classify images.

The feature maps is an important element in Neural Style Transfer.

Counting Convolutional layer from conv1_1 to conv5_4 and add 3 Fully-connected layers is 19. Therefore it is named VGG19.

Neural Style Transfer

Here I am using VGG19 for Neural Style Transfer. There are 3 images will be passed through VGG19 model then get output of feature maps from Convolutional layer to calculate style and content loss. Finally update generated image with loss of style and content.

Generated image here means final produced image with style neither style image nor content image.

Neural Style Transfer

3 input images

These images will be passed into VGG19 model.

  • Style image: the image represent style
  • Content image: the image represent original content
  • Generated image: combination of style and content images

In the beginning, generated image can either be a white noise image or a copy of content image. And it be updated toward final image which is combination of style and content throughout training process.

Content loss

Get the feature maps from conv5_2(layer 2 in block 5) for both content image and generated image. And calculate content loss with MSE(Mean Square Error) loss.

Mean Square Error

Style loss

Compute correlation across the different feature maps by gram matrix which include correlation information.

First get the feature maps from conv1_1, conv2_1, conv3_1, conv4_1 and conv5_1 for both style image and generated image. Second calculate gram matrix for each of them then calculate loss with MSE. Finally add up their loss. And that is style loss.

Gram Matrix

With gram matrix, we are able to find the correlation between the different feature maps. Remember when calculating gram matrix across the different feature maps, every feature maps need to be flatten. For example in order to calculate gram matrix of a feature maps of shape [32, 224, 224](feature, height, width), it need to be turned into shape of [32, 50176](feature, height*width).

Implementation

I am using PyTorch for this task. Here is the code written in Colab.

The implementation is base on this article.

Setup

Import necessary libraries, download content and style images and setup device to work with.

Load and visualize image

For loading and prepare our images for content and style.

Define a function load_image and this function use PyTorch’s read_image utility to read image from file paths then transform image and return a list of transformed images. In addition the function also remove alpha color channel(line 35) from input image since we expect images in RGB and add 1 dimension(line 36) to image, this extra dimension is batch dimension.

In order to visualize our prepared images.

Define a function show_images to display a set of PyTorch’s images. Line 31 is where this function swap dimensions on image because PyTorch’s read_image utility return image as [channel, height, width], in order to properly display image we need to convert it to [height, width, channel].

VGG19 model

We need to create our model from VGG19 because some layers in VGG19 are not included for this style transfer task. Beside we need to create style loss and content loss layer and attach them after convolutional layers.

Load VGG19 model and freeze its layers.

Load VGG19 model with default pre-trained weight and excluded fully-connected layers .features.eval() then make sure layers’ gradient is disabled.

Gram Matrix, Content loss layer, Style loss layer and Normalization layer

Gram Matrix

Line 12 we transform input tensor into 2 dimensional vectors where first dimension represent number of feature maps and second dimension is flatten features. At line 13 calculate gram matrix with matrix multiplication. Finally line 14 normalize gram matrix and return the result.

Content loss

We need to create a content loss layer to track loss value.

In initialization, we capture target features that are output from convolutional layer after content image been fed to model. And it will be used for calculating the loss.

In forward, we calculate loss between input and target features by MSE loss function.

input here is the feature maps from convolutional layer after input image been fed to model.

Style loss

Similar to content loss layer.

Instead of using features, we calculate gram matrix of features and then find out the loss.

Normalization

According to PyTorch’s Neural Style Transfer.

VGG networks are trained on images with each channel normalized by mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225]. We will use them to normalize the image before sending it into the network.

Create the model

Put everything together to create our model.

Line 30 ~ 49 we extract each layers from VGG19 model and added them to our new model. In addition we track their names. Line 52 ~ 63 we either add content loss layer or style loss layer if the current layer’s name match any specified name in list.

Be aware of line 39. ReLU layer is recreated with inplace=False to prevent runtime error because VGG19’s ReLU layer has inplace=True

Visualize feature maps

For visualizing output feature maps of a particular convolutional layer that was mention early in article.

Prepare model and forward hook.

A forward hook is a function that can be attached to a particular layer in a model and will be called after the layer’s forward method is called.

get_feature_hook function will return a hook function. In function get_feature_maps we register forward hook to any specified layer in a model.

Visualize Conv1_1

conv1_1 feature maps

Visualize Conv5_1

conv5_1 feature maps

Training

For Neural Style Transfer, we will use optimizer to optimize the input image not our model. Only thing we do in training is to pass input image into model and calculate the loss of content and style then update input image.

We will train with 2 different optimizer Adam and LBFGS. Beside we will save image at interval during training.

Setting

We declare some variables for training.

Save image at interval

We define a function to save image during training.

Train with Adam optimizer

Adam training
Adam final

Train with LBFGS optimizer

Line 87 input image must be contiguous otherwise it will cause runtime error

LBFGS training
LBFGS final

Code

Source code in Colab.

Conclusion

This is Neural Style Transfer.

The key point of applying style on an image is Gram Matrix. We can observe content loss go up while style loss go down during training. In addition we can select different convolutional layers in different block for experiment.

Neural Style Transfer approach is quite different than GAN. It need a paired images whereas GAN don’t need. The weight can be adjusted in Neural Style Transfer to either lean toward more on content or style.

If you have small set of images and need more controls then Neural Style Transfer can be an option. GAN is more general and used for large set of images.

--

--