Style Transfer using Neural Nets

Published in

Analytics Vidhya

11 min readJul 28, 2020

Out of the plethora of applications that implement AI or the numerous day-to-day problems which have been solved by applying AI, style transfer is one of the most fascinating & innovative ideas one has ever come up with. Legendary artists like Vincent van Gogh, Leonardo da Vinci, Pablo Picasso, Michelangelo etc have produced masterpieces which still lives with us till this day. The works of these artists depict the artistic version of everyday life, historic events, religious themes from the perspective of each artist. The works of these legendary artists are unique & have a definitive characteristic style of painting such as the use of colour, tone, strokes, texture, shapes, the level of abstractness and many more aspects which makes it difficult for common human beings to replicate their style. Now Imagine applying the artistic style to a picture of New York’s skyline, an aerial photograph of the marine drive in Mumbai or to a photograph of a person.

One of the earliest works called Demystifying Neural Style Transfer was successfully able to transfer style from one image to another. Given two images, the artistic image & the image to be styled, the neural network can detect the style from the artistic image & apply it to the other image.

Neural Style Transfer was so groundbreaking that a group of people made a fortune by selling a painting made by a GAN (A type of Deep Neural Network) for half a million dollars ($432,000) at one of the most prestigious auction houses in London.

What is Neural Style Transfer?

Neural Style transfer algorithms generate an artistic version of an image by manipulating it to adopt the style & appearance of another image. The algorithm applies an optimization technique that deals with 3 different images :

Content image.
Style image.
Generated output image.

Content Image: The input image we want to extract the content from & transfer the style, as the name indicates this image has the base content such as the buildings or a person who has to be styled.

Style Image: The image from which the DNN captures the style & applies it to the content.

Generated image: The output image which is a styled version of the content image. The generated image is the output which incorporates the content part (buildings/person) from the content image & the captures the style from style image.

Unlike classical deep learning tasks, the deep neural network isn’t trained back & forth rather the algorithm applies optimization to the output generated image to look more like the artistic version of the content image. The weights & biases of the DNN are frozen while reducing the error between generated output image & content image, generated image output image & style image.

Getting Started

As discussed, the deep neural network is not trained to rectify the artistic output image. The convolution neural network is only used to extract the content features & the artistic style (feature maps) from the two input images. A pre-trained state of the art network like VGG19 trained on ImageNet dataset is used for this purpose & the network weights/kernels remain fixed throughout the process.

The idea behind using a pre-trained network is that its kernels can capture the content features & style features at some level of the network. Convolutional Neural Network extract features of an input image as its propagated forward, starting from simple primitive features at lower layers to mapping complex patterns & designs at the deeper layers. If you don’t have a clear understanding behind “how CNN work”, feel free to read The world through the eyes of CNN.

By feeding in the content image & style image to the pre-trained VGG network, we obtain the feature maps of content & style images. The output feature maps of the content image represent the entity of the image (building, car, person etc) & the output feature maps of style image represent the artistic elements of the image such as the brush strokes, the mixture of colours, texture, tone etc.

Content image feature maps are extracted from the deeper layers of the network as they replicate complex patterns, objects in foreground & background or the entity in the image itself.

Style image feature maps are obtained for intermediate or lower levels of the network, which capture the texture & patterns in the image, in our case that would be the brush strokes, the combination of colours etc.

As discussed, in neural style transfer the convolutional neural network is not trained. Training the network is basically calculating the loss or error presented by the network & applying an optimization algorithm to tweak the network weights & biases in order to reduce the error of the network.

The output image is initialized with random pixel values or noise. The output image feature maps are compared with those of content & style feature maps in a loss function to formulate the loss/error.

The pixel values of the output image are modified by using an optimization algorithm which minimizes the loss/error wrt output image pixels at each iteration.

In simple terms, neural style transfer keeps the weights & biases of the network fixed instead modifying the output image iteratively by minimizing the loss or cost function at each iteration.

Note: The output image is also propagated through the network obtain its corresponding target/output content features & target/output style features.
These target/output feature maps are compared with input content & style feature maps representation in the loss function.

How is the artistic output generated?

The loss function is composed of content loss & the style loss terms which determine how far or erroneous is the output artistic image is in terms of incorporating the content part & style elements from its respective content & style feature maps.

As discussed previously, we obtained feature maps for the content image & style image from different stages of the network. These feature maps help determine content loss & style loss. The total loss is the sum of content & style loss.

Content Loss

It is based on the intuition that images with similar content will have similar feature map representation at the deeper layers of the network. For this reason, we propagate the output image (initially with random noise) through the VGG19 network & obtain target/output image feature maps at the same depth from which content image feature maps are produced.

Content loss estimates the mean squared error between the feature maps of content image & the output target image.

P^l — feature map of the original image.
F^l — feature map of the generated image.
l — layer at which content features are extracted.

If the content image & generated image have similar feature representation at deeper layers, then the content loss value will be minimum.

Content loss calculates the pixel-by-pixel difference between output & content input image feature maps.

Style Loss

Style loss is similar to content loss, with a twist in the way the error is estimated in style loss. The process of calculating style loss is :

Unlike content features, style (artistic) input image feature maps are obtained from all the layers of the CNN. Thus, the corresponding target-style features for the output image is also obtained from all the layers.
We obtain the gram matrix representation of style feature maps at each layer.

What is Gram Matrix representation?

The paper describes that style information is measured as the amount of correlation between style feature maps. Gram Matrix precisely does this. The process calculates the correlation between feature maps of a layer l.

This correlation between feature maps is calculated for all the specified style layers for input style image & output image.

Calculating the style loss

The style loss is defined as the squared difference between the gram matrix representations of feature maps of generated image & input style image.

A^l — gram matrix representation of input style image.

G^l — gram matrix of the generated image at a layer l.

N-number of feature maps or the number of filters at layer l.

M-product of height & width of feature maps.

The total style loss is the linear weighted combination of error at each style layer.

El- style error at layer l.

wl- weight contribution to each layer. This controls the extent to which a particular layer contributes to the loss.

Total Loss

The total loss is the sum of content loss & style loss. How does this loss function describe Neural Style Transfer?

The content of the generated image must be similar to the content of the content image & the style of the generated image must be similar to the style of style image. alpha & beta are weights for controlling the representation content & style representation on the final generated image.

if alpha > beta the final image will incorporate more of the content features & less of style.
if alpha < beta the final image will be more artistic & will capture less of the content.

Run an optimization algorithm like SGD or Adam to reduce the total loss wrt the output image.

At each iteration, the optimization algorithm modifies the pixel values of the output image in the direction of minimum error.

Summarizing the process

Load the input content image c & style image s.
Load a pre-trained network like VGG19.
Specify the layers at which content feature maps & style feature maps are extracted.
Obtain the content image ‘c’ feature maps & style image ‘s’ feature maps.
Obtain Gram Matrix representation of the style features. Also, specifying the style weights for each layer (constant values).
Initialize the output/generated image with random pixel values.
Initialize an optimizer/optimization algorithm like SGD/Adam/Adagrad to minimize error wrt to the output image. Lastly, initialize alpha & beta as some constant values.
Propagate the output image through VGG19 network, obtain the feature maps of output image at the content layer aka target content feature maps.
Calculate content loss between target content feature map & the feature map of input content image c.
Obtain the style feature maps of the output image at every style layer aka target-style feature maps.
Calculate the Gram Matrix of target-style feature maps.
Calculate the error between Gram matrices of target-style feature maps & input style image feature maps.
Repeat this for every style layer to obtain overall style loss.
Calculate the total loss (content loss + style loss).
Run the optimizer/optimization algorithm to reduce total loss/error wrt the output image.
Repeat steps 8–15 until total loss reaches a minimum value.

Code

The following code to implement neural style transfer is written in PyTorch, a popular deep learning library.

Importing the necessary modules

from PIL import Image
import matplotlib.pyplot as plt
import numpy as npimport torchimport torch.optim as optim
from torchvision import transforms, models

2. Define a function to load input images (content & style). Open the image using PIL library which returns the NumPy representation of the image. Convert the image into a torch tensor by applying transformations

def load_image(image_path,max_size=400,shape=None):
  image = Image.open(image_path).convert('RGB')
  if max(image.size) > max_size:
    size = max_size
  else:
    size = max(image.size)
  if shape is not None:
    size = shapein_transforms = transforms.Compose([transforms.Resize((size,int(1.5 * size))),transforms.ToTensor(),transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))])image = in_transforms(image)[:3,:,:].unsqueeze(0)
  return image

3. Create a method to convert the torch tensor back into NumPy representation so that it can be displayed using Matplotlib.

def im_convert(tensor):
  image = tensor.to("cpu").clone().detach()
  image = image.numpy().squeeze()
  image = image.transpose(1, 2, 0)
  image = image * np.array((0.229, 0.224, 0.225)) + np.array(
    (0.485, 0.456, 0.406))
  image = image.clip(0, 1)
  
  return image

4. Define a method to extract feature maps from specified layers of the network/model. Store the feature maps in a feature map dictionary. This method is applied for both content & style image.

def get_features(image,model,layers=None):
  if layers is None:
    layers = {'0': 'conv1_1','5': 'conv2_1',
              '10': 'conv3_1',
              '19': 'conv4_1',
              '21': 'conv4_2',  ## content layer
              '28': 'conv5_1'}
  features = {}
  x = image
  
  for name,layer in enumerate(model.features):
    x = layer(x)
    if str(name) in layers:
      features[layers[str(name)]] = x
  
  return features

5. Define a method to calculate the gram matrix representation of all the style feature maps.

def gram_matrix(tensor):
  _,n_filters, h, w = tensor.size()
  tensor = tensor.view(n_filters, h*w)
  gram = torch.mm(tensor,tensor.t())return gram

6. Create an instance of VGG19 network, & initialize it with pre-trained weights of ImageNet. Disable gradients of all the network parameters as there is no need to train the network & weights remain fixed. You can download the weights from here.

vgg = models.vgg19()
vgg.load_state_dict(torch.load('vgg19-dcbb9e9d.pth'))for param in vgg.parameters():
  param.requires_grad_(False)

7. Replace the max-pooling layers with average pooling, for better results. The reason being max-pool removes information by whereas average pooling retains the lost information by including it in the process of averaging the pixel values.

for i,layer in enumerate(vgg.features):
  if isinstance(layer,torch.nn.MaxPool2d):
    vgg.features[i] =  torch.nn.AvgPool2d(kernel_size=2,stride=2,padding=0)

8. Check if your machine has a GPU so that PyTorch can exploit the use of GPU to increase processing speed.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vgg.to(device).eval()

9. Load the input content & style images.

content = load_image('content image path').to(device)
style = load_image('style image path').to(device)

10. Obtain the corresponding content & style feature maps from their respective input images.

content_features = get_features(content,vgg)
style_features = get_features(style, vgg)

11. Obtain the gram matrix representation of style feature maps.

style_grams = {
layer : gram_matrix(style_features[layer]) for layer in style_features
}

12. Create an output image tensor with random values, of the same size as the content image.

target = torch.rand_like(content).requires_grad_(True).to(device)

13. Initialize style weight values for every layer from which style feature maps are extracted.

style_weights = {'conv1_1': 0.75,
                 'conv2_1': 0.5,
                 'conv3_1': 0.2,
                 'conv4_1': 0.2,
                 'conv5_1': 0.2}

14. Create an instance of Adam optimizer to optimize the error wrt to the output target tensor.

optimizer = optim.Adam([target], lr=0.01)

15. Specify the alpha(content weight) & beta (style weight) values for the total loss function.

content_weight = 1e4
style_weight = 1e2

16. Run the style transfer loop for a specified number of iterations. Each iteration calculating the content loss & style loss. Applying Adam optimizer to reduce the overall loss wrt to the target.

for i in range(2000):
  optimizer.zero_grad()
  target_features = get_features(target, vgg)
  
  content_loss = torch.mean((target_features['conv4_2'] - content_features['conv4_2'])**2)
  
  style_loss = 0
  for layer in style_weights:
    target_feature = target_features[layer] 
    target_gram = gram_matrix(target_feature)
    _,d,h,w = target_feature.shape
    style_gram = style_grams[layer]layer_style_loss = style_weights[layer] * torch.mean((target_gram - style_gram)**2)style_loss += layer_style_loss/(d*h*w)total_loss = content_weight*content_loss + style_weight*style_losstotal_loss.backward(retain_graph=True)optimizer.step()if i % 10 == 0:
    total_loss_rounded = round(total_loss.item(),2)content_fraction = round(content_weight*content_loss.item()/total_loss.item(),2)style_fraction = round(style_weight*style_loss.item()/total_loss.item(),2)print('Iteration {}, Total loss: {} - (content: {}, style {})'.format(
      i,total_loss_rounded, content_fraction, style_fraction))

17. Convert the output tensor back into NumPy array to display it & see the results.

final_img = im_convert(target)
plt.imshow(final_img)