Comparing Artificial Artists

Kyle McDonald
Sep 1, 2015 · 6 min read

Last Wednesday, “A Neural Algorithm of Artistic Style” was posted to ArXiv, featuring some of the most compelling imagery generated by deep convolutional neural networks (DCNNs) since Google Research’s “DeepDream” post.

Excerpt of figure 2 from “A Neural Algorithm of Artistic Style”

On Sunday, Kai Sheng Tai posted the first public implementation. I immediately stopped working on my implementation and started playing with his. Unfortunately, his results don’t quite match the paper, and it’s unclear why. I’m just getting started with this topic, so as I learn I want to share my understanding of the algorithm here, along with some results I got from testing his code.


In two parts, the paper describes an algorithm for rendering a photo in the style of a given painting:

  1. Run an image through a DCNN trained for image classification. Stop at one of the convolutional layers, and extract the activations of every filter in that layer. Now run an image of noise through the net, and check its activations at that layer. Make small changes to the noisy input image until the activations match, and you will eventually construct a similar image. They call this “content reconstruction”, and depending what layer you do it at you get varying accuracy.
Content reconstruction, excerpt from figure 1.

2. Instead of trying to match the activations exactly, try to match the correlation of the activations. They call this “style reconstruction”, and depending on the layer you reconstruct you get varying levels of abstraction. The correlation feature they use is called a Gram matrix: the dot product between the vectorized feature activation matrix and its transpose. If this sounds confusing, see the footnotes.

Style reconstruction, excerpt from figure 1.

Finally, instead of optimizing for just one of these things, they optimize for both simultaneously: the style of one image, and the content of another image.

Here is an attempt to recreate the results from the paper using Kai’s implementation:

In the style of “Composition VII” by Kandinsky
In the style of “The Scream” by Munch
In the style of “Seated Nude” by Picasso
In the style of “The Shipwreck of the Minotaur” by Turner.
In the style of “The Starry Night” by Gogh

Not quite the same, and possibly explained by a few differences between Kai’s implementation and the original paper:

  • Using SGD , while the original paper does not specify what optimization technique is used. In an earlier texture synthesis paper the authors use L-BFGS.
Tübingen in the style of Kandinsky, in an attempt to recreate figure 3.
  • As I was writing this, Kai added total variational smoothing. This certainly helps with the high frequency noise, but the fact that the original paper does not mention any similar regularization makes me wonder if they achieve this another way.
Comparison of Kai’s implementation without smoothing (left) and with.

As a final comparison, consider the images Andrej Karpathy posted from his own implementation.

Gandalf in the style of Picasso. Left image produced by Andrej Karpathy.

The same large-scale, high-level features are missing here, just like in the style reconstruction of “Seated Nude” above.


Beside’s Kai’s, I’ve seen one more implementation from a PhD student named Satoshi: a brief example in Python with Chainer. I haven’t spent as much time with it, as I had to adapt it to run on my CPU due to lack of memory. But I did notice:

  • It uses content to style ratios in a more similar range to the original paper. Changing this by an order of magnitude doesn’t seem to have as big an effect.

After running Tübingen in the style of The Starry Night with a 1:10e3 ratio and 100 iterations, it seems to converge on something matching the general structure but lacking the overall palette:

Generated using Satoshi’s example.

I’d like to understand this algorithm well enough to generalize it to other media (mainly thinking about sound right now), so if you have an insights or other implementations please share them in the comments!

Update

I’ve started testing another implementation that popped up this morning from Justin Johnson. His follow the original paper very closely, except for using unequal weights when balancing different layers used for style reconstruction. All the following examples were run for 100 iterations with the default ratio of 1:10e0.

In the style of “Composition VII” by Kandinsky.
In the style of “The Scream” by Munch.
In the style of “Seated Nude” by Picasso
In the style of “The Shipwreck of the Minotaur” by Turner
In the style of “The Starry Night” by Gogh
Gandalf in the style of “A Muse” by Picasso

Final Update

Justin switched his implementation to use L-BFGS and equally weighted layers, and to my eyes this matches the results in the original paper. Here are his results for one of the harder content/style pairs:

In the style of “Seated Nudeby Picasso

Other implementations that look great, but I haven’t tested enough:

  • neural_artistic_style by Anders Boesen Lindbo Larsen, in Python using DeepPy and cuDNN. Example images look great, and are very high resolution.

Footnotes

The definition of the Gram matrix confused me at first, so I wrote it out as code. Using a literal translation of equation 3 in the paper, you would write in Python, with numpy:

def gram(layer):
N = layer.shape[1]
F = layer.reshape(N, -1)
M = F.shape[1]
G = np.zeros((N, N))
for i in range(N):
for j in range(N):
for k in range(M):
G[i,j] += F[i,k] * F[j,k]
return G

It turns out that the original description is computed more efficiently than this literal translation. For example, Kai writes in Lua, with Torch:

function gram(input)
local k = input:size(2)
local flat = input:view(k, -1)
local gram = torch.mm(flat, flat:t())
return gram
end

Satoshi computes it for all the layers simultaneously in Python with Chainer:

conv1_1F,conv2_1F,conv3_1F,conv4_1F,conv5_1F, = [ reshape2(x) for x in [conv1_1,conv2_1, conv3_1, conv4_1,conv5_1]]
conv1_1G,conv2_1G,conv3_1G,conv4_1G,conv5_1G, = [ Fu.matmul(x, x, transa=False, transb=True) for x in [conv1_1F,conv2_1F, conv3_1F, conv4_1F,conv5_1F]]

Or again in Python, with numpy and Caffe layers:

def gram(x):
F = layer.reshape(layer.shape[1], -1)
return np.dot(F, F.T)

Kyle McDonald

Written by

Artist working with code.