Last Wednesday, “A Neural Algorithm of Artistic Style” was posted to ArXiv, featuring some of the most compelling imagery generated by deep convolutional neural networks (DCNNs) since Google Research’s “DeepDream” post.
On Sunday, Kai Sheng Tai posted the first public implementation. I immediately stopped working on my implementation and started playing with his. Unfortunately, his results don’t quite match the paper, and it’s unclear why. I’m just getting started with this topic, so as I learn I want to share my understanding of the algorithm here, along with some results I got from testing his code.
In two parts, the paper describes an algorithm for rendering a photo in the style of a given painting:
- Run an image through a DCNN trained for image classification. Stop at one of the convolutional layers, and extract the activations of every filter in that layer. Now run an image of noise through the net, and check its activations at that layer. Make small changes to the noisy input image until the activations match, and you will eventually construct a similar image. They call this “content reconstruction”, and depending what layer you do it at you get varying accuracy.
2. Instead of trying to match the activations exactly, try to match the correlation of the activations. They call this “style reconstruction”, and depending on the layer you reconstruct you get varying levels of abstraction. The correlation feature they use is called a Gram matrix: the dot product between the vectorized feature activation matrix and its transpose. If this sounds confusing, see the footnotes.
Finally, instead of optimizing for just one of these things, they optimize for both simultaneously: the style of one image, and the content of another image.
Here is an attempt to recreate the results from the paper using Kai’s implementation:
Not quite the same, and possibly explained by a few differences between Kai’s implementation and the original paper:
- Using SGD , while the original paper does not specify what optimization technique is used. In an earlier texture synthesis paper the authors use L-BFGS.
- Initializing with the content image rather than noise.
- Using the Inception network instead of VGG-19.
- To balance the content reconstruction with the style reconstruction, the paper uses a weighting of 1:10e1 or 1:10e2, while Kai uses 1:5e9, which is a huge and unexplained difference. Running even slightly lower around 1:10e8 it converges mainly on the content reconstruction, only vaguely matching the palette of the style image:
- As I was writing this, Kai added total variational smoothing. This certainly helps with the high frequency noise, but the fact that the original paper does not mention any similar regularization makes me wonder if they achieve this another way.
As a final comparison, consider the images Andrej Karpathy posted from his own implementation.
The same large-scale, high-level features are missing here, just like in the style reconstruction of “Seated Nude” above.
Beside’s Kai’s, I’ve seen one more implementation from a PhD student named Satoshi: a brief example in Python with Chainer. I haven’t spent as much time with it, as I had to adapt it to run on my CPU due to lack of memory. But I did notice:
- It uses content to style ratios in a more similar range to the original paper. Changing this by an order of magnitude doesn’t seem to have as big an effect.
- It tries to initialize with noise.
- It uses VGG-19.
After running Tübingen in the style of The Starry Night with a 1:10e3 ratio and 100 iterations, it seems to converge on something matching the general structure but lacking the overall palette:
I’d like to understand this algorithm well enough to generalize it to other media (mainly thinking about sound right now), so if you have an insights or other implementations please share them in the comments!
I’ve started testing another implementation that popped up this morning from Justin Johnson. His follow the original paper very closely, except for using unequal weights when balancing different layers used for style reconstruction. All the following examples were run for 100 iterations with the default ratio of 1:10e0.
Justin switched his implementation to use L-BFGS and equally weighted layers, and to my eyes this matches the results in the original paper. Here are his results for one of the harder content/style pairs:
Other implementations that look great, but I haven’t tested enough:
- neural_artistic_style by Anders Boesen Lindbo Larsen, in Python using DeepPy and cuDNN. Example images look great, and are very high resolution.
- styletransfer by Eben Olson, in Python using Lasagne. Replicates both style transfer and style reconstruction from noise.
The definition of the Gram matrix confused me at first, so I wrote it out as code. Using a literal translation of equation 3 in the paper, you would write in Python, with numpy:
N = layer.shape
F = layer.reshape(N, -1)
M = F.shape
G = np.zeros((N, N))
for i in range(N):
for j in range(N):
for k in range(M):
G[i,j] += F[i,k] * F[j,k]
It turns out that the original description is computed more efficiently than this literal translation. For example, Kai writes in Lua, with Torch:
local k = input:size(2)
local flat = input:view(k, -1)
local gram = torch.mm(flat, flat:t())
Satoshi computes it for all the layers simultaneously in Python with Chainer:
conv1_1F,conv2_1F,conv3_1F,conv4_1F,conv5_1F, = [ reshape2(x) for x in [conv1_1,conv2_1, conv3_1, conv4_1,conv5_1]]
conv1_1G,conv2_1G,conv3_1G,conv4_1G,conv5_1G, = [ Fu.matmul(x, x, transa=False, transb=True) for x in [conv1_1F,conv2_1F, conv3_1F, conv4_1F,conv5_1F]]
Or again in Python, with numpy and Caffe layers:
F = layer.reshape(layer.shape, -1)
return np.dot(F, F.T)