Spring Quarter of my freshman year, I took Stanford’s CS 231n course on Convolutional Neural Networks. My final project for the course dealt with a super cool concept called neural style transfer, in which the style of a piece of artwork is transferred onto a picture. Here’s a classic example — a picture of Hoover Tower at Stanford, in the style of The Starry Night:
The first published paper on neural style transfer used an optimization technique — that is, starting off with a random noise image and making it more and more desirable with every “training” iteration of the neural network. This technique has been covered in quite a few tutorials around the web. However, the technique of a subsequent paper, which is what really made neural style transfer blow up, has not been covered as thoroughly. This technique is feedforward — train a network to do the stylizations for a given painting beforehand so that it can produce stylized images instantly.
In this tutorial, we will cover both techniques, the intuitions and math behind how they work, and how the second builds off the first. We will also cover the next step: cutting-edge research as explored by me in my CS 231n project, as well as research teams at Cornell and Google. By the end, you will be able to train your own networks to produce beautiful pieces of art!
In this method, we do not use a neural network in a true sense. That is, we aren’t training a network to do anything. We are simply taking advantage of backpropagation to minimize two defined loss values. The tensor which we backpropagate into is the stylized image we wish to achieve — which we call the pastiche from here on out. We also have as inputs the artwork whose style we want to transfer, known as the style image, and the picture that we want to transfer the style onto, known as the content image.
The pastiche is initialized to be random noise. It, along with the content and style images, are then passed through several layers of a network that is pretrained on image classification. We use the outputs of various intermediate layers to compute two types of losses: style loss and content loss — that is, how close is the pastiche to the style image in style, and how close is the pastiche to the content image in content. Those losses are then minimized by directly changing our pastiche image. By the end of a few iterations, the pastiche image now has the style of the style image and the content of the content image — or, said differently, it is a stylized version of the original content image.
Before we dive into the math and intuition behind the losses, let’s address a concern you may have. You may be wondering why we use the outputs of intermediate layers of a pretrained image classification network to compute our style and content losses. This is because, for a network to be able to do image classification, it has to understand the image. So, between taking the image as input and outputting its guess at what it is, it’s doing transformations to turn the image pixels into an internal understanding of the content of the image.
We can interpret these internal understandings as intermediate semantic representations of the initial image and use those representations to “compare” the content of two images. As an example: if we pass two images of cats through an image classification network, even if the initial images look very different, after being passed through many internal layers, their representations will be very close in raw value. This is the content loss — pass both the pastiche image and the content image through some layers of an image classification network and find the Euclidean distance between the intermediate representations of those images. Here’s the equation for content loss:
The summation notation makes the concept look harder than it really is. Basically, we make a list of layers at which we want to compute content loss. We pass the content and pastiches images through the network until a particular layer in the list, take the output of that layer, square the difference between each corresponding value in the output, and sum them all up. We do this for every layer in the list, and sum those up. One thing to note, though: we multiply each of the representations by some value alpha (called the content weight) before finding their differences and squaring it, whereas the original equation calls for the value to be multiplied after squaring it. I found, in practice, the former to work much better than the latter, as it produces appealing stylizations much more quickly.
The style loss is very similar, except instead of comparing the raw outputs of the style and pastiche images at various layers, we compare the Gram matrices of the outputs. A Gram matrix results from multiplying a matrix with the transpose of itself:
Because every column is multiplied with every row in the matrix, we can think of the spatial information that was contained in the original representations to have been “distributed”. The Gram matrix instead contains non-localized information about the image, such as texture, shapes, and weights — style!
Now that we have defined the Gram matrix as having information about style, we can find the Euclidean distance between the Gram matrices of the intermediate representations of the pastiche and style image to find how similar they are in style:
Similar to our style loss computation, we find the Euclidean distances between each corresponding pair of values in the Gram matrices computed at each layer in a predefined list of layers, multiplied by some value beta (known as the style weight).
We have the content loss — which contains information on how close the pastiche is in content to the content image — and the style loss — which contains information on how close the pastiche is in style to the style image. We can now add them together to get the total loss. We then backpropagate through the network to reduce this loss by getting a gradient on the pastiche image and iteratively changing it to make it look more and more like a stylized content image. This is all described in more rigorous detail in the original paper on the topic by Gatys et al.
Now that we know how it works, let’s build it. We’ll be using the PyTorch library for all our code, so make sure you have it installed.
First, let’s implement the Gram matrix layer:
Now, let’s define our network. Let’s create a class with an initializer that sets some important variables:
The variables which we initialize are: the style, content, and pastiches images; the layers at which we compute content and style loss, as well as the alpha and beta weights we multiply the representations by; the pretrained network that we use to get the intermediate representations (we use VGG-19); the Gram matrix computation; the loss computation we do (as MSE is the same as Euclidean distance); and the optimizer we use. We also want to make use of a GPU if we have on on our machine.
Now for the network’s “training” regime. We pass the images through the network one layer at a time. We check to see if it is a layer at which we do a content or style loss computation. If it is, we compute the appropriate loss at that layer. Finally, we add the content and style losses together and call backward on that loss, and take an update step. Here’s how all that looks:
(To use the LBGFS optimizer, it is necessary to pass into the step function a closure function which “reevaluates the model and returns the loss”; we don’t need to do that with any other optimizer…go figure)
One more step before we can start transferring some style — we need to write up a couple of convenience functions:
The image_loader function opens an image at a path and loads it as a PyTorch variable of size imsize; the save_image function turns the pastiches image, which is a PyTorch variable, into the appropriate PIL format to save it to file.
Now, to produce beautiful art. First, let’s import some stuff, load our images, and put them onto the GPU if we have one. Then, let’s define and call our training loop to turn our initially random pastiche into an incredible piece of art. Here’s how that all looks:
And that’s it! We have used an optimization method to generate a stylized version of a content image. You can now render any image into the style of any painting — albeit slowly, as the optimization process is iterative. Here’s an example that I made from our code — a dancer in the style of Figure Dans un Fauteuil by Picasso:
Why does the previous method take so much time? Why do we have to wait for an optimization loop to generate a pretty picture? Can’t we tell a network, “Hey, learn the style of Starry Night so well that I can give you any picture and you’ll turn it into a Starry Night-ified version of the picture instantly”? As a matter of fact, we can!
Essentially, we create an untrained Image Transformation Network which transforms the content image into its best guess at an appealing pastiche image. We then use this as the pastiche image which, along with the content and style images, is passed through the pretrained image classification network (now called the Loss Network) to compute our content and style losses. Finally, to minimize the loss, we backpropagate into the parameters of the Image Transormation Network, not directly into the pastiche image. We do this with a ton of random content image examples, thereby training the Image Transformation Network to transform any given picture into the style of some predefined artwork.
We now need to add an Image Transformation Network to our multi-stage network architecture, and make sure that the optimizer is set to optimize the parameters of the ITN. We will be using the architecture from Johnson et al. for our ITN. Replace the last 8 lines of the StyleCNN initializer with this code:
Also, since we now want our training regime to accept a content image batch as training examples, we can remove the content image as an input to the initializer. Our training regime accepts a content image batch, transforms it into pastiche images, computes losses as done above, and calls backward on the final loss to update the ITN parameters. Here’s all that in code:
Awesome! Let’s change our save_image function to be able to save not just one image, but a batch of images:
def save_images(input, paths):
N = input.size()
images = input.data.clone().cpu()
for n in range(N):
image = images[n]
image = image.view(3, imsize, imsize)
image = unloader(image)
One more order of business. We need to have available to us a dataset of content images as training examples to be passed into our network at every training iteration. We will use the Microsoft COCO dataset for this. You have to install and set up the COCO API before you can do anything with it in PyTorch. The API can be download from source here. After downloading it, install it by running setup.py.
Once you’ve completed those steps, add these imports:
import torchvision.datasets as datasets
Finally, we replace our main function with this code:
Woo! Now we can train a network to transform any image into a stylized version of it, based on the style of our prepicked image. That is awesome! In case you were wondering, this technology is exactly how apps like Prisma work. Here are some example outputs from three feedforward networks trained on three different paintings:
Arbitrary Style Transfer
You may have noticed that, although our feedforward implementation can produce stylized pastiches instantly, it can only do so for one given style image. Would it be possible to train a network which can take, in addition to any content image, any style image and produce a pastiche from those two images? In other words, can we make a truly arbitrary neural style transfer network?
As it so happens, a very cool finding in this problem suggested that it is possible. A few years ago, it was found that the Instance Normalization layers in the Image Transformation Networks were the only important layers to represent style. In other words, if we keep all Convolutional parameters the same and learn only new Instance Normalization parameters, we can represent completely different styles in just one network. Here’s the paper with that finding. This suggests that we can take advantage of Instance Norm layers by using, in those layers, parameters that are specific to each style.
The first attempt to do this was from a team at Cornell. Their solution was using Adaptive Instance Normalization, in which an encoder-decoder architecture was used to generate the Instance Norm parameters from the style images. Here is the paper on that. This had fairly high success.
The next attempt at arbitrary style transfer was my own project, in which I use a secondary Normalization Network to transform the style image into Instance Normalization parameters to be used by the Image Transformation Network, learning the parameters in both the Normalization Network and the ITN all at once. Here is my paper on that. This method was able to demonstrate success on a small set of random styles and therefore was not truly arbitrary; however, it was theoretically successful.
The latest attempt at achieving arbitrary style transfer was from a team at Google Brain. They used a method which happens to be almost identical to the my method, as described above, except that they use a pre-trained Inception network to transform the style images into Instance Norm parameters instead of using an untrained network and training it, along with the ITN, end-to-end. This method saw wild success and produced beautiful results, achieving true arbitrary neural style transfer. Here is the awesome paper on that.
While I was unable to fully solve the problem, I felt honored to have been able to participate in the research conversation and be on the cutting-edge of solving such an intriguing problem. It was thrilling to see Google finally solve the problem with such an elegant and beautiful solution.
There’s quite a bit of literature on this problem, so some great next steps would be to skim through them. I linked the papers throughout the article, but I thought I’d post them all in one place too:
- Here is the original paper on neural style transfer, which proposed the optimization process
- Here is the paper on feedforward style transfer; its supplementary materials contain the architecture that we used in our code.
- If you were wondering what Instance Normalization even is, here is the paper describing it.
- Here is the Cornell paper on Adaptive Instance Norm
- Here is my paper on end-to-end arbitrary style transfer network training
- Here is the Google paper which achieved arbitrary style transfer
Hope you found this exciting application of neural networks to be as fascinating and fun as I do. Happy learning!