Neural Image Style Transfer

Arjun Mahes
Geek Culture
Published in
8 min readJan 4, 2022
Claude Monet Painting

You give a photo from your camera roll and select a design. Then you get a new image that has the content of your input image and the style of the design image. We know this from the Prisma, but how does the technique work? In machine learning, we call this algorithm “Style Transfer”.

Style transfer is combining the style of one image into another.

Style transfer uses a pre-trained CNN (convolutional neural network) to find the content features of an image — the objects — and the style features of another image — brush, texture, colour — to mix them.

I spent minutes on end absorbing wave cat | Source: Intro to Deep learning with Pytorch on Udacity

The best style transfer models can keep the features of an image while changing the stylistic attributes of it altogether.

In this article, we will explore the technical processes behind this algorithm.

TL; DR

  • Content & Style Representation: How does the machine understand and store aspects from images? What is the VGG 19 CNN? What is the gram matrix?
  • Loss Functions: What do we optimize? How do we combine the style and content image? What is style transfer?

The Style Transfer Process (to be further explained)

  • The content image is passed through the VGG-19 network where the content representation is extracted.
  • The style image is passed through the VGG-19 network where the style representation is extracted and stored using a gram matrix.
  • Then the original content image is iteratively optimized with gradient descent and the previously calculated content and style representations.

Capturing the Content and Style Representation

The primary two tasks to address for Style Transfer are capturing the content representation of one image and the style of another to combine them.

We capture these representations, using a CNN — the VGG 19 in our case —and the gram matrix respectively

An Overview of CNNs

In this section I review the basics of CNNs and how they work. If you know the basics feel free to skip ahead 🙂. If you aren’t familiar with it please read this in-depth source to learn more about convolutions, layers, and filters.

A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm that can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other.

Although the architecture of a CNN can vary for its sub-types — VGG-19, Alex-Net, and others — it runs with the same working process. All CNN’s have two parts:

  • Convolutional Layers
  • Max Pooling Layers

Convolutional layers (blue) are the container of “feature maps”. They store the detected features — or the convolved image — in the input. The deeper the layer the more the feature maps or stored features.

Pooling layers (orange) are used to reduce the dimensions and simplify the convolutional layers. Hence, the number of parameters to learn will be reduced.

As the input travels through the convolutional and pooling layers, a progressively more comprehensive and detailed representation of the content is generated.

notice including convolution and pooling layers there are also fully connected layers. as style transfer does not require any prediction these fully connected layers are not included in our architecture further in the blog. | VGG-19 CNN 👆

In our style transfer model, we use the VGG-19 CNN framework. This CNN has 19 weighted layers including 3 dense layers which we remove.

Using a VGG-19 model is crucial in our style transfer program to understand the style and content images thoroughly.

Content Representation

As you know, the later layers of the VGG-19 CNN capture the content of the image better — hence called the content representation. The content representation identifies lines and intensities to outline the objects of an image

content representation being created | notice the three input channels stand for the colour channels (R, G, B)

Notice that the content representation discards all irrelevant details including the style to preserve the primary objects in an image.

In our VGG-19 CNN, we extract the content representation in the second convolution of the fourth convolutional stack — as used by Gatys L in his paper. This ensures a comprehensive but less condensed representation.

the purple rectangle represents the content representation of the image as said in the ‘Neural Style Transfer’ paper

Gram Matrix

As we know, to get the content representation, we pass the content image through a CNN — the VGG19 in this case — in which the later layers have better content representations.

Extracting style is unexpectedly simpler.

The ‘style’ of an image is the overarching theme of the colours, textures, and emotion used. Or, in a more theoretical view, how does the texture in the cypress tree of ‘Starry Night’ relate with the texture of the moon? If so what are the similarities and differences?

Style is how each aspect of a painting relates with each other

how does the texture in the cypress tree of starry night relate with the texture of the moon?

To find the style, we begin by inputting the style image through the VGG-19 CNN to observe the correlations between feature maps in convolutional layers. In other words, we are looking at the similarity between features in an image. These similarities are the style of the image.

We can store these similarities as a Style Representation with a Gram Matrix.

The Gram matrix is calculated by taking the dot product of the unrolled intermediate representation of a convolutional layer and its transpose. In simpler — programmers’ — words, we dot a list of flattened feature maps in a convolutional layer with its transpose.

and thus the Gram Matrix storing the style features in one convolutional layer is made
a flattened feature map, the random numbers represent pixel values

As recommended in Leon Gatys’ paper, we calculate the gram matrix of every first convolutional layer in all convolutional stacks in the VGG-19 network. The style representation of an image is the list of gram matrices in the VGG-19 network.

calculating the gram matrix at every first layer of each stack

— and thus we know how to calculate the content and style representations of an image!

Loss Functions

After obtaining content and style representations, our next step is to create the target image — the combination of calculated representations. We achieve this using loss functions and gradient descent.

Loss functions inform the accuracy of a model. The lower the loss the higher the accuracy.

The loss functions used are Style Loss and Content Loss. Simply they find the error between the target image and style and content images respectively. The target image is first initialized as the original content image and is iteratively optimized to a mixed image that has the least style and content loss.

Content Loss

As we form our target image, we compare the content representations of the target and content image. A better style transfer model makes these two representations as close as possible even as our target image changes its style.

We define a content loss to find the difference between representations. Or a loss that calculates the contrast between the content and target image content representations. Here, we use the mean squared difference (MSE) as the loss function.

T_C is target image content representation and C_C is content image content representation

To progressively generate a better target image, we aim to minimize this loss. Although this process is similar to how loss is used in optimization to determine the weights of a CNN, here our aim is not to minimize classification error.

Our goal is to change only the target image, updating its appearance until its content representation matches that of the content image. We are not using the VGG 19 network in its traditional sense, but only for feature extraction — gradient descent is used to lower the loss between our target and content images.

We are not using the VGG-19 for classification but only for feature extraction

Style Loss

In the same way, we calculate the difference between the target image and the style image with a style loss. This function like the content loss attempts to make the content and target image style representations as close as possible.

T_s is the target image style representation, while S_s is the style image style representation, a is the number of values in each layer

We find the mean squared distance between the style and target image style representations. Recall both style representations contain five gram matrices that are computed at each first layer of each convolutional stack in the VGG-19 network.

In the above equation, we call Ss and Ts, and A is a constant that accounts for the number of values in each layer. We’ll multiply these five calculated distances by some style weights w that we specify, to finally sum.

The style weights are values that vary the effect of the style representation on the target image. The more the weight the higher the effect.

We only change the target image’s style representations as we minimize this loss over several iterations.

Optimization and Next Steps

Now that we comprehend the ideas of representation and loss, the next step is to decrease the total loss.

Total loss is simply the addition of style loss and content loss.

We decrease total loss using typical gradient descent and backpropagation by iteratively changing the target image to match our desired content and style.

Read this article to learn more about gradient descent and optimization algorithms work and this one to learn how to code your own using TensorFlow.

Hope you have a blast coding your own Style Transfer algorithm! Hope you learned something new and are excited to code this model on your own. Cheers 🍻!

Some Inspiration 🙌 | Made by Thushan Ganegedara

Before you go…

Through this article, you have learned the fundamentals of how style transfer works! 🎉🎊 We explored the feature extraction, loss functions, and how style transfer optimization works! Good luck on your machine learning journey!

Machine learning has always been a hotspot for me. It is used everywhere whether in snapchat filters or spam classifiers. Today, it’s more of a lifestyle than a buzzword.

That is why I got into the field of data-science. Since the beginning, I have been addicted and I hope I will always be.

if you enjoyed reading through this article feel free to connect with my socials 🤗
LinkedIn | Newsletter | Twitter

--

--