Neural Image Style Transfer
You give a photo from your camera roll and select a design. Then you get a new image that has the content of your input image and the style of the design image. We know this from the Prisma, but how does the technique work? In machine learning, we call this algorithm “Style Transfer”.
Style transfer is combining the style of one image into another.
Style transfer uses a pre-trained CNN (convolutional neural network) to find the content features of an image — the objects — and the style features of another image — brush, texture, colour — to mix them.
The best style transfer models can keep the features of an image while changing the stylistic attributes of it altogether.
In this article, we will explore the technical processes behind this algorithm.
TL; DR
- Content & Style Representation: How does the machine understand and store aspects from images? What is the VGG 19 CNN? What is the gram matrix?
- Loss Functions: What do we optimize? How do we combine the style and content image? What is style transfer?
The Style Transfer Process (to be further explained)
- The content image is passed through the VGG-19 network where the content representation is extracted.
- The style image is passed through the VGG-19 network where the style representation is extracted and stored using a gram matrix.
- Then the original content image is iteratively optimized with gradient descent and the previously calculated content and style representations.
Capturing the Content and Style Representation
The primary two tasks to address for Style Transfer are capturing the content representation of one image and the style of another to combine them.
We capture these representations, using a CNN — the VGG 19 in our case —and the gram matrix respectively
An Overview of CNNs
In this section I review the basics of CNNs and how they work. If you know the basics feel free to skip ahead 🙂. If you aren’t familiar with it please read this in-depth source to learn more about convolutions, layers, and filters.
A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm that can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other.
Although the architecture of a CNN can vary for its sub-types — VGG-19, Alex-Net, and others — it runs with the same working process. All CNN’s have two parts:
- Convolutional Layers
- Max Pooling Layers
Convolutional layers (blue) are the container of “feature maps”. They store the detected features — or the convolved image — in the input. The deeper the layer the more the feature maps or stored features.
Pooling layers (orange) are used to reduce the dimensions and simplify the convolutional layers. Hence, the number of parameters to learn will be reduced.
As the input travels through the convolutional and pooling layers, a progressively more comprehensive and detailed representation of the content is generated.
In our style transfer model, we use the VGG-19 CNN framework. This CNN has 19 weighted layers including 3 dense layers which we remove.
Using a VGG-19 model is crucial in our style transfer program to understand the style and content images thoroughly.
Content Representation
As you know, the later layers of the VGG-19 CNN capture the content of the image better — hence called the content representation. The content representation identifies lines and intensities to outline the objects of an image
Notice that the content representation discards all irrelevant details including the style to preserve the primary objects in an image.
In our VGG-19 CNN, we extract the content representation in the second convolution of the fourth convolutional stack — as used by Gatys L in his paper. This ensures a comprehensive but less condensed representation.
Gram Matrix
As we know, to get the content representation, we pass the content image through a CNN — the VGG19 in this case — in which the later layers have better content representations.
Extracting style is unexpectedly simpler.
The ‘style’ of an image is the overarching theme of the colours, textures, and emotion used. Or, in a more theoretical view, how does the texture in the cypress tree of ‘Starry Night’ relate with the texture of the moon? If so what are the similarities and differences?
Style is how each aspect of a painting relates with each other
To find the style, we begin by inputting the style image through the VGG-19 CNN to observe the correlations between feature maps in convolutional layers. In other words, we are looking at the similarity between features in an image. These similarities are the style of the image.
We can store these similarities as a Style Representation with a Gram Matrix.
The Gram matrix is calculated by taking the dot product of the unrolled intermediate representation of a convolutional layer and its transpose. In simpler — programmers’ — words, we dot a list of flattened feature maps in a convolutional layer with its transpose.
As recommended in Leon Gatys’ paper, we calculate the gram matrix of every first convolutional layer in all convolutional stacks in the VGG-19 network. The style representation of an image is the list of gram matrices in the VGG-19 network.
— and thus we know how to calculate the content and style representations of an image!
Loss Functions
After obtaining content and style representations, our next step is to create the target image — the combination of calculated representations. We achieve this using loss functions and gradient descent.
Loss functions inform the accuracy of a model. The lower the loss the higher the accuracy.
The loss functions used are Style Loss and Content Loss. Simply they find the error between the target image and style and content images respectively. The target image is first initialized as the original content image and is iteratively optimized to a mixed image that has the least style and content loss.
Content Loss
As we form our target image, we compare the content representations of the target and content image. A better style transfer model makes these two representations as close as possible even as our target image changes its style.
We define a content loss to find the difference between representations. Or a loss that calculates the contrast between the content and target image content representations. Here, we use the mean squared difference (MSE) as the loss function.
To progressively generate a better target image, we aim to minimize this loss. Although this process is similar to how loss is used in optimization to determine the weights of a CNN, here our aim is not to minimize classification error.
Our goal is to change only the target image, updating its appearance until its content representation matches that of the content image. We are not using the VGG 19 network in its traditional sense, but only for feature extraction — gradient descent is used to lower the loss between our target and content images.
We are not using the VGG-19 for classification but only for feature extraction
Style Loss
In the same way, we calculate the difference between the target image and the style image with a style loss. This function like the content loss attempts to make the content and target image style representations as close as possible.
We find the mean squared distance between the style and target image style representations. Recall both style representations contain five gram matrices that are computed at each first layer of each convolutional stack in the VGG-19 network.
In the above equation, we call Ss and Ts, and A is a constant that accounts for the number of values in each layer. We’ll multiply these five calculated distances by some style weights w that we specify, to finally sum.
The style weights are values that vary the effect of the style representation on the target image. The more the weight the higher the effect.
We only change the target image’s style representations as we minimize this loss over several iterations.
Optimization and Next Steps
Now that we comprehend the ideas of representation and loss, the next step is to decrease the total loss.
Total loss is simply the addition of style loss and content loss.
We decrease total loss using typical gradient descent and backpropagation by iteratively changing the target image to match our desired content and style.
Read this article to learn more about gradient descent and optimization algorithms work and this one to learn how to code your own using TensorFlow.
Hope you have a blast coding your own Style Transfer algorithm! Hope you learned something new and are excited to code this model on your own. Cheers 🍻!
Before you go…
Through this article, you have learned the fundamentals of how style transfer works! 🎉🎊 We explored the feature extraction, loss functions, and how style transfer optimization works! Good luck on your machine learning journey!
Machine learning has always been a hotspot for me. It is used everywhere whether in snapchat filters or spam classifiers. Today, it’s more of a lifestyle than a buzzword.
That is why I got into the field of data-science. Since the beginning, I have been addicted and I hope I will always be.
if you enjoyed reading through this article feel free to connect with my socials 🤗
LinkedIn | Newsletter | Twitter