Paper Summary: A Neural Algorithm of Artistic Style

Mike Plotz
3 min readNov 17, 2018

--

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/01.

A Neural Algorithm of Artistic Style (2015) https://arxiv.org/abs/1508.06576 Leon A. Gatys, Alexander S. Ecker, Matthias Bethge

This one is short and sweet as far as methods go. It’s based on the idea that both biological vision systems and convolutional networks (CNNs), when trained on natural images, learn to be invariant to changes in how objects appear (lighting, rotation, texture, etc). Empirically this manifests as a separation between content and style, the former showing up directly in the “feature responses” (activations) within the network and the latter showing up in correlations between different features. The content representation preserves spatial information, whereas the style representation removes all spatial components by averaging across the spatial dimensions.

The intuitive interpretation of the style representation is something like, in Van Gogh’s The Starry Night, when there’s a border between dark and light, there’s a dark brush stroke with a particular thickness parallel to that border. These are the kinds of features that CNNs tend to pick up easily.

Here’s the basic method:

  1. Start with VGG with average pooling (they saw better gradient flow compared to max pooling)
  2. Take a content image, a style image, and a target image initialized to random pixel values (note: Jeremy Howard says that he had to blur the random image to get it to train)
  3. Define a content loss (MSE of the feature activations for the content image and the target image)
  4. And a style loss (MSE of the Gram matrices of the style image and the target image)
  5. Run SGD, optimizing a linear combination of the losses by modifying the pixel values of the target image

The neat thing about this approach is that it’s easy to see the effect of each loss function separately (see screenshot below).

There’s a bit of a detail that I glossed over in the above, which is that you have to choose which layer(s) to use for your losses. Basically the idea is that earlier / lower layers in the network are closer to the pixel values and later / higher layers have more complex / semantic effects. To get good results you have to try out different layers / weightings of layers, as well as different weights in the linear combination of losses (they used ratios of 1e-3 to 1e-4).

Some math. The Gram matrix is

where l is the layer, the first subscript of the Fs is the filter channel, and the second subscript is the spatial location. As I noted above, this throws away all spatial information.

The losses are

and

Here F is the features of the target image and P is the features of the original content image. Similarly G is the Gram matrix of the target image and A is the Gram matrix of the style image.

I’d say this paper is emblematic of the promise of deep learning. You start with a simple idea, define some loss functions, and a fair amount of fiddling later, you get some pretty amazing results!

--

--

Mike Plotz

yet another bay area software engineer • learning junkie • searching for the right level of meta • also pie