Comparing Artificial Artists

Kyle McDonald
Sep 1, 2015 · 6 min read

Last Wednesday, “A Neural Algorithm of Artistic Style” was posted to ArXiv, featuring some of the most compelling imagery generated by deep convolutional neural networks (DCNNs) since Google Research’s “DeepDream” post.

Image for post
Image for post
Excerpt of figure 2 from “A Neural Algorithm of Artistic Style”

On Sunday, Kai Sheng Tai posted the first public implementation. I immediately stopped working on my implementation and started playing with his. Unfortunately, his results don’t quite match the paper, and it’s unclear why. I’m just getting started with this topic, so as I learn I want to share my understanding of the algorithm here, along with some results I got from testing his code.


In two parts, the paper describes an algorithm for rendering a photo in the style of a given painting:

  1. Run an image through a DCNN trained for image classification. Stop at one of the convolutional layers, and extract the activations of every filter in that layer. Now run an image of noise through the net, and check its activations at that layer. Make small changes to the noisy input image until the activations match, and you will eventually construct a similar image. They call this “content reconstruction”, and depending what layer you do it at you get varying accuracy.
Content reconstruction, excerpt from figure 1.

2. Instead of trying to match the activations exactly, try to match the correlation of the activations. They call this “style reconstruction”, and depending on the layer you reconstruct you get varying levels of abstraction. The correlation feature they use is called a Gram matrix: the dot product between the vectorized feature activation matrix and its transpose. If this sounds confusing, see the footnotes.

Image for post
Image for post
Style reconstruction, excerpt from figure 1.

Finally, instead of optimizing for just one of these things, they optimize for both simultaneously: the style of one image, and the content of another image.

Here is an attempt to recreate the results from the paper using Kai’s implementation:

Image for post
Image for post
In the style of “Composition VII” by Kandinsky
Image for post
Image for post
In the style of “The Scream” by Munch
Image for post
Image for post
In the style of “Seated Nude” by Picasso
Image for post
Image for post
In the style of “The Shipwreck of the Minotaur” by Turner.
Image for post
Image for post
In the style of “The Starry Night” by Gogh

Not quite the same, and possibly explained by a few differences between Kai’s implementation and the original paper:

  • Using SGD , while the original paper does not specify what optimization technique is used. In an earlier texture synthesis paper the authors use L-BFGS.
  • Initializing with the content image rather than noise.
  • Using the Inception network instead of VGG-19.
  • To balance the content reconstruction with the style reconstruction, the paper uses a weighting of 1:10e1 or 1:10e2, while Kai uses 1:5e9, which is a huge and unexplained difference. Running even slightly lower around 1:10e8 it converges mainly on the content reconstruction, only vaguely matching the palette of the style image:
Image for post
Image for post
Tübingen in the style of Kandinsky, in an attempt to recreate figure 3.
  • As I was writing this, Kai added total variational smoothing. This certainly helps with the high frequency noise, but the fact that the original paper does not mention any similar regularization makes me wonder if they achieve this another way.
Image for post
Image for post
Comparison of Kai’s implementation without smoothing (left) and with.

As a final comparison, consider the images Andrej Karpathy posted from his own implementation.

Image for post
Image for post
Gandalf in the style of Picasso. Left image produced by Andrej Karpathy.

The same large-scale, high-level features are missing here, just like in the style reconstruction of “Seated Nude” above.


Beside’s Kai’s, I’ve seen one more implementation from a PhD student named Satoshi: a brief example in Python with Chainer. I haven’t spent as much time with it, as I had to adapt it to run on my CPU due to lack of memory. But I did notice:

  • It uses content to style ratios in a more similar range to the original paper. Changing this by an order of magnitude doesn’t seem to have as big an effect.
  • It tries to initialize with noise.
  • It uses VGG-19.

After running Tübingen in the style of The Starry Night with a 1:10e3 ratio and 100 iterations, it seems to converge on something matching the general structure but lacking the overall palette:

Image for post
Image for post
Generated using Satoshi’s example.

I’d like to understand this algorithm well enough to generalize it to other media (mainly thinking about sound right now), so if you have an insights or other implementations please share them in the comments!

Update

I’ve started testing another implementation that popped up this morning from Justin Johnson. His follow the original paper very closely, except for using unequal weights when balancing different layers used for style reconstruction. All the following examples were run for 100 iterations with the default ratio of 1:10e0.

Image for post
Image for post
In the style of “Composition VII” by Kandinsky.
Image for post
Image for post
In the style of “The Scream” by Munch.
Image for post
Image for post
In the style of “Seated Nude” by Picasso
Image for post
Image for post
In the style of “The Shipwreck of the Minotaur” by Turner
Image for post
Image for post
In the style of “The Starry Night” by Gogh
Image for post
Image for post
Gandalf in the style of “A Muse” by Picasso

Final Update

Justin switched his implementation to use L-BFGS and equally weighted layers, and to my eyes this matches the results in the original paper. Here are his results for one of the harder content/style pairs:

Image for post
Image for post
In the style of “Seated Nudeby Picasso

Other implementations that look great, but I haven’t tested enough:

  • neural_artistic_style by Anders Boesen Lindbo Larsen, in Python using DeepPy and cuDNN. Example images look great, and are very high resolution.
  • styletransfer by Eben Olson, in Python using Lasagne. Replicates both style transfer and style reconstruction from noise.

Footnotes

The definition of the Gram matrix confused me at first, so I wrote it out as code. Using a literal translation of equation 3 in the paper, you would write in Python, with numpy:

def gram(layer):
N = layer.shape[1]
F = layer.reshape(N, -1)
M = F.shape[1]
G = np.zeros((N, N))
for i in range(N):
for j in range(N):
for k in range(M):
G[i,j] += F[i,k] * F[j,k]
return G

It turns out that the original description is computed more efficiently than this literal translation. For example, Kai writes in Lua, with Torch:

function gram(input)
local k = input:size(2)
local flat = input:view(k, -1)
local gram = torch.mm(flat, flat:t())
return gram
end

Satoshi computes it for all the layers simultaneously in Python with Chainer:

conv1_1F,conv2_1F,conv3_1F,conv4_1F,conv5_1F, = [ reshape2(x) for x in [conv1_1,conv2_1, conv3_1, conv4_1,conv5_1]]
conv1_1G,conv2_1G,conv3_1G,conv4_1G,conv5_1G, = [ Fu.matmul(x, x, transa=False, transb=True) for x in [conv1_1F,conv2_1F, conv3_1F, conv4_1F,conv5_1F]]

Or again in Python, with numpy and Caffe layers:

def gram(x):
F = layer.reshape(layer.shape[1], -1)
return np.dot(F, F.T)

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store