NeuroNuggets: Style Transfer

Published in

Neuromation

7 min readApr 11, 2018

In the fourth installment of the NeuroNuggets series, we continue our study of basic problems in computer vision. We remind that in the NeuroNuggets series, we discuss the demos available on the recently released NeuroPlatform, concentrating not so much on the demos themselves but rather on the ideas behind each deep learning model.

In the first installment, we talked about the age and gender detection model, which is basically image classification. In the second, we presented object detection, a more complex computer vision problem where you also have to find where an object is located. In the third, we continued this with segmentation and Mask R-CNN model. Today, we turn to something even more exciting. We will consider a popular and very beautiful application of deep learning: style transfer. We will see how to make a model that can draw your portrait in the style of Monet — or Yves Klein if you’re not careful with training! I am also very happy to present the co-author of this post, one of our lead researchers Kyryl Truskovskyi:

Style Transfer: the Problem

Imagine that you can create a true artwork by yourself, turning your own photo or a familiar landscape into a painting done like Van Gogh or Picasso would do it. Sounds like a pipe dream? With the help of deep neural networks, this dream has now become reality. Neural style transfer has become a trending topic both in academic literature and industrial applications; we all know and use popular mobile apps for style transfer and image enhancement. Starting from 2014, these style transfer apps have become a huge PR point for deep neural networks, and today almost all smartphone users have tried some style transfer app for photos and/or video. By now, all of these methods work in real-time and run on mobile devices, and anyone can create artwork with a simple app, stylizing their own images and videos… but how do these apps work their magic?

From the deep learning perspective, ideas for style transfer stem from attempts to interpret the features that a deep neural network learns and understand how exactly it works. Recall that a convolutional neural network for image processing gradually learns more and more convoluted features (see, e.g., our first post in the NeuroNuggets series), starting from basic local filters and getting all the way to semantic features like “dog” or “house”… or, quite possibly, “Monet style”! The basic idea of style transfer is to try to disentangle these features, pull apart semantic features of “what is on the picture” from “how the picture looks”. Once you do it, you can try to replace the style while leaving the contents in place, and the style can come from a completely different painting.

For the reader interested in a more detailed view, we recommend “Neural Style Transfer: A Review”, a detailed survey of neural style transfer methods. E.g., here is an example from that paper where a photo of the Great Wall of China has been redone in classical Chinese painting style:

And here is an example from DeepArt, one of the first style transfer services:

You can even do it with videos:

But let’s see how it works!

Neural Style Transfer by Optimizing the Image

There are two main approaches to neural style transfer: optimization and feedforward network. Let us begin with optimization.

In this approach, we optimize not the network but the resulting image. This is a very simple but powerful idea: the image itself is also just a matrix (tensor) of numbers, just like the weights of the network. This means that we can take derivatives with respect to these weights, extending backpropagation to them too and optimizing an image for a network rather then the other way around; we have already discussed this idea in “Can a Neural Network Read Your Mind?”.

For style transfer, it works as follows: you have a content image, style image and trained neural network (on ImageNet for example). You create a newly generated image, initializing it completely at random. Then the content image and style image pass through the early and intermediate layers of the network to compute two types of loss functions: style loss and content loss (see below). Next, we optimize their losses by changing the generated image, and after a few iterations we have beautiful stylized images. The structure is a bit intimidating:

But do not worry — let’s talk about the losses. After we pass an image through a network we get a feature map from intermediate layers. This feature map captures the semantic representation of this image. And we definitely want the new image to be similar to the content image, so for content image feature maps C and generated image feature maps P we get the following content loss:

The style loss is slightly different: for the style loss we compute Gram matrices of the intermediate representations of generated images and style image:

The style loss is the Euclidean distance between Gram matrices

By directly minimizing the sum of these losses by gradient descent on our generated image, we make it more and more similar to the style image in terms of style while still keeping the content due to the content loss. We refer to the original paper, “A Neural Algorithm of Artistic Style”, for details. This approach works great, but its main disadvantage is that it takes a lot of computational effort.

Neural Style Transfer Via Image Transformation Networks

The basic idea for the next is to use feed-forward networks for image transformation tasks. Basically, we want to create an image transformation network that would directly create beautiful stylized images, with no complicated “image training” process.

The algorithm is simple. Suppose that, again, you have a content image and a style image. You feed the content image through the image transformation network and get a new generated image. After that, loss networks are used to compute style and content losses, like in the optimization method above, but after that, we optimize not the image itself but the image transformation network. This way, we get a trained image transformation network for style transfer and then can use it for stylizing as only a forward pass without any optimization at all.

Here is how it looks through the entire process with the image transformation network and the loss network:

The original paper on this approach, “Perceptual Losses for Real-Time Style Transfer and Super-Resolution”, led to the first truly popular and the first real-time implementation of style transfer and superresolution algorithms; you can find a sample code for this approach here. The image transformation network works fast, but its main disadvantage is that we need to train a completely separate new network for every style image. This is not a problem for preset instagram-like filters but does not solve the general problem.

To fix this, authors of “A Learned Representation for Artistic Style” introduced an approach called conditional instance normalization. The basic idea is to train one network for several different styles. It turns out that normalization plays a huge role in style networks to model a style, and it is sufficient to specialize scaling and shifting parameters after normalization to each specific style. In simpler words, it is enough just to tune the parameters of a simple transformation after normalization for each style, before image transformation network.

Here is a picture of how it works; for the mathematical details, we refer to the original paper:

You can find an example code for this model here.

In the platform demo, we use the Python framework PyTorch for training the neural networks, Flask + Gunicorne to serve it on our platform, and Docker + Mesos for deploying and scaling.