Neural Style Transfer with Swift for TensorFlow

10 min readMay 2, 2019

**Left:** The Painted Ladies by King of Hearts / Wikimedia Commons / CC-BY-SA-3.0 | **Middle:** A Starry Night by Vincent van Gogh, c. June 1889 | **Right:** Image generated by employing the Neural Style Transfer algorithm, Gatys et al.

What is Neural Style Transfer?

Simply put, Neural Style Transfer [Gatys et al.] is a process by which we take the style of one image, the content of another image and generate a new image that exhibits the same stylistic features as the style image while preserving the high-level structure of the content image.

The neural part, as you might expect, is where the real magic happens. If you’ve read about or taken a course on Convolutional Neural Networks, you’ve probably encountered the wonderful paper Visualizing and Understanding Convolutional Networks by Zeiler and Fergus. This paper keenly illustrates the fact that each kernel in a conv net acts as a sort of feature extractor. The features start out primitive and local but as the net goes deeper, the features become more abstract and global. Neural Style Transfer exploits this fact, extracting style information from the activations of various layers. We also extract content information from the higher layers of the network. We use this information to compute what is referred to in the literature as the “Perceptual Loss”.

Given an input image, our goal is to minimize Perceptual Loss with respect to that image. In order to compute the Perceptual Loss, we need the following:

A style image.
A content image.
A styled image. This is the output. We can initialize it a number of ways (which we’ll explore below).
A way of extracting style information from layer activations.
A pre-trained conv net (we’ll use VGG-19 [Karen Simonyan, Andrew Zisserman]).

You’ll need the latest Swift for TensorFlow toolchain. See https://github.com/tensorflow/swift for installation instructions, etc. Also, Jeremy over at fast.ai wrote up a great install guide. Note: You’ll probably need a relatively modern GPU to run this yourself. It might run on your CPU, albeit very slowly.

The code for this blog post is available as a jupyter notebook, along with all of the other resources, on GitHub. Assuming you have a working Swift for TensorFlow (and swift-jupyter) installation on Ubuntu, it might be more ideal to follow along there.

Install dependencies via SPM

Installing packages: .package(url: "https://github.com/mxcl/Path.swift", from: "0.16.1") Path With SwiftPM flags: [] Working in: /tmp/tmp7lgo13l0/swift-install Compile Swift Module 'jupyterInstalledPackages' (1 sources) Initializing Swift... Installation complete!

Import dependencies

This is a utility for loading VGG-19 weights from a checkpoint file. See the GitHub repo for details.

Set up an extension on String that allows us to conveniently run shell commands via string literals! This is from the fast.ai swift notebooks.

Grab the VGG-19 checkpoint

Make sure you have wget installed on your system. If you’d rather not use it, you can try using curl instead.

Add a gram matrix method to `Tensor`

We use the gram matrix to extract the correlation structure between the filters of a given layer. This effectively allows us to decouple texture information from the global arrangement of the scene. We’ll use this later when we calculate the perceptual loss. See the paper for more details.

Define a layer for the input image

Disclaimer: This is probably not the best way to handle this, but after trying a few alternatives, I landed on this. The idea is that we want to freeze the parameters of the Conv Net and only update the input image during back prop. At the time of writing this, I couldn’t find a straightforward way to A) freeze a layer while still computing all of the gradients and B) have the input image be a tune-able parameter. I have some ideas on how I might be able to better handle this, but this works well enough for now.

Define a struct to store the output activations

Swift for TensorFlow also doesn’t currently support differentiable control flow which leads to more repetitious code, though I’ve been told that this feature is coming very soon! It also doesn’t support pulling activations out of a net after a forward pass. There are definitely work arounds for both of these issues but they involve writing writing some less-than-straightforward workarounds. At some point this functionality will be built into the Swift for TensorFlow deep learning framework.

Define a concrete pooling type

We’ll use this to easily swap which type of pooling we’re using. Again, once differentiable control flow lands, this sort of thing shouldn’t be necessary. I tried getting this to work nicely with generics but to no avail.

Implement the VGG19 architecture without the classification head

This code should look fairly straightforward. Again, I could probably reduce some of the repition but this is just a first pass on this to get something up and running. I may do an update where I go in and refactor some of this (especially once differentiable control flow is supported).

Tie the two layers together

I chose to compose the two layers into one here. I had written this prior to S4TF 0.3 which I believe introduced a better way to sequence layers together. I’ll probably revisit this also.

Define an optimizer

This is probably the hackiest part of this. I’m sure there’s a way to use the existing Adam optimizer and freeze all the parameters in the network, updating only the image, however I couldn’t figure it out and this was the best I could come up with. It works, but I’ll definitely try to revisit this on my next pass.

This code is identical to the Adam definition in the S4TF Deep Learning library except for the fact that it breaks out of the for loop after updating the first differentiable variable. The first differentiable variable happens the image :)

Total variation loss

Total variation loss is a regularization technique that is commonly used to denoise images. Unfortunately this currently causes the GPU memory to grow unbounded and triggers an OOM error. There’s probably a clever way to avoid this, but the results look pretty good without it. I’ve kept it here for reference. Perhaps someone can spot what I’m doing wrong and let me know ;)

Perceptual Loss

This is the loss function we’ll use. We compute the mean squared error (MSE) between the target content activations and the styled image. We then do the same for the style activation layers but instead of computing the MSE of the raw activations, we compute the MSE of the gram matrix of the activations. Each style and content layer loss is then scaled by its own weight. Lastly, every thing is summed up and returned as our final loss.

A clamping utility

This utility is used to keep the color values in the range that VGG19 expects to see, i.e. +/- the imagenet mean pixel values (in BGR). Without this, regions of the image would over excite the network causing clipped regions and aberrant noise. Note: It might be an interesting experiment to clamp these values with a smooth function instead of min/max. It might reduce the noise a bit.

Use Python interoperability to show images via matplotlib

Image processing utilities

These functions are used to load up the images into Float tensors and pre/post process them. This pre-trained VGG-19 model was trained on images in BGR channel order that were normalized by the imagenet mean. Note that they did not divide by the standard deviation so values be in the range of [-mean, mean].

Set up matplotlib for inline display

('inline', 'module://ipykernel.pylab.backend_inline')

A utility to display an image tensor

Define a struct to hold the training results

This is where the results of training get stored so we can see how the training progressed over time. It also has a function that will plot the output images in a grid. This is nifty when trying out and comparing different hyper-parameters.

Peek at the style and content images

We’ll use the images from the graphic at the beginning of the post. Note: If you run into memory issues, you can drop the size down to 256.

TensorShape(dimensions: [1, 512, 512, 3]) 
TensorShape(dimensions: [1, 512, 512, 3])

Define the training method

Right now I’ve only really needed to change the style weights, content weight, iteration count and learning rate, so that’s what’s exposed.

Tie things up into a function we can experiment with

This will allow us to play around with the weights, learning rate, etc.

Let’s finally perform style transfer to make an image

We’ll just use most of the default params for the style transfer function.

[Iteration 0 - Perceptual Loss: 2.778563e+08] 
[Iteration 50 - Perceptual Loss: 4116344.5] 
[Iteration 100 - Perceptual Loss: 2275590.2] 
[Iteration 150 - Perceptual Loss: 1557501.5] 
[Iteration 200 - Perceptual Loss: 1184193.0] 
[Iteration 250 - Perceptual Loss: 1057808.2] 
[Iteration 300 - Perceptual Loss: 863599.44] 
[Iteration 350 - Perceptual Loss: 778174.4] 
[Iteration 400 - Perceptual Loss: 737367.2] 
[Iteration 450 - Perceptual Loss: 676374.0] 
[Iteration 500 - Perceptual Loss: 946441.3]

Pretty cool, huh? The original paper uses L-BFGS which is a pretty memory intensive optimization algorithm (at least compared to standard NN optimzation algorithms like Adam). L-BFGS tends to produce better results, but I think this looks pretty good. Of couse we’d need to try it on more than one style/content pair to make a fair assessment.

Let’s look at how the image progressed over time

Notice how it actually converges to something pretty reasonable after about 100 iterations

Now let’s try initializing with the style image instead

We’ll use higher content weights this time around. As one might imagine starting with the style image will bias towards the global structure of the style image. We could alternatively tune the style weights down a bit, but let’s see where this gets us.

[Iteration 0 - Perceptual Loss: 16007397.0] 
[Iteration 50 - Perceptual Loss: 10292353.0] 
[Iteration 100 - Perceptual Loss: 9144543.0] 
[Iteration 150 - Perceptual Loss: 7475930.0] 
[Iteration 200 - Perceptual Loss: 7086185.5] 
[Iteration 250 - Perceptual Loss: 6741186.5] 
[Iteration 300 - Perceptual Loss: 6937329.0] 
[Iteration 350 - Perceptual Loss: 6654474.0] 
[Iteration 400 - Perceptual Loss: 6679740.5] 
[Iteration 450 - Perceptual Loss: 11030902.0] 
[Iteration 500 - Perceptual Loss: 6526403.5]

So as you can tell, the global structure of the style image is still very present. This is definitely a more abstract result!

Let’s try to tune the style weights

[Iteration 0 - Perceptual Loss: 16007397.0] 
[Iteration 50 - Perceptual Loss: 6788499.0] 
[Iteration 100 - Perceptual Loss: 5649164.0] 
[Iteration 150 - Perceptual Loss: 5271180.0] 
[Iteration 200 - Perceptual Loss: 5076626.5] 
[Iteration 250 - Perceptual Loss: 4990565.5] 
[Iteration 300 - Perceptual Loss: 4901208.0] 
[Iteration 350 - Perceptual Loss: 4857651.0] 
[Iteration 400 - Perceptual Loss: 4806595.0] 
[Iteration 450 - Perceptual Loss: 4777933.0] 
[Iteration 500 - Perceptual Loss: 4765858.5]

This definitely produces a clearer result. It’s worth showing the content image again for comparison.

Let’s try using average instead of max pooling

[Iteration 0 - Perceptual Loss: 279485.72] 
[Iteration 50 - Perceptual Loss: 202219.42] 
[Iteration 100 - Perceptual Loss: 169860.62] 
[Iteration 150 - Perceptual Loss: 156222.81] 
[Iteration 200 - Perceptual Loss: 148178.36] 
[Iteration 250 - Perceptual Loss: 143206.12] 
[Iteration 300 - Perceptual Loss: 140567.17] 
[Iteration 350 - Perceptual Loss: 137404.88] 
[Iteration 400 - Perceptual Loss: 135017.0] 
[Iteration 450 - Perceptual Loss: 133670.3] 
[Iteration 500 - Perceptual Loss: 132263.81]

**Left** Max Pooling — **Right** Average Pooling

It’s not necessarily an apples to apples comparison, but I think average pooling might be less noisy.

To close out, let’s test on some new images

Here’s our style image.

Starry Night Over the Rhône by Vincent van Gogh

We’ll apply it to this image of the Oakland Bay Bridge.

Dllu [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)]

I’ve made some tweaks to the weights and learning rate. One thing to point out, and it’s something others have observed as well, is that using Adam for style transfer requires a bit more tuning than L-BGFS

[Iteration 0 - Perceptual Loss: 2.8845638e+07]
[Iteration 50 - Perceptual Loss: 15046891.0]
[Iteration 100 - Perceptual Loss: 12473184.0]
[Iteration 150 - Perceptual Loss: 12153522.0]
[Iteration 200 - Perceptual Loss: 11189293.0]
[Iteration 250 - Perceptual Loss: 10998533.0]
[Iteration 300 - Perceptual Loss: 11554881.0]
[Iteration 350 - Perceptual Loss: 10926209.0]
[Iteration 400 - Perceptual Loss: 12236732.0]
[Iteration 450 - Perceptual Loss: 11646822.0]
[Iteration 500 - Perceptual Loss: 10367661.0]

Closing thoughts

I’d like to take another pass at this and refactor it to be more “Swifty”. I’d also like to take a stab at implementing L-BFGS in Swift for TensorFlow. Hopefully I can contribute this stuff back to the Swift for TensorFlow community.

Given the fact that Swift for TensorFlow is so new, I honestly wasn’t sure that I’d be able to get to this point. That’s why some of the code is kind of non-idiomatic — I was just trying to see if this was possible.This isn’t your run-of-the-mill classification problem. There’s still a lot of room for improvement, but I look forward to taking the time to get things running more smoothly. There are some cool ideas in this follow paper by Gatys et al. that I also want to try to implement.

Thanks for following along!

Here’s the repo again: https://github.com/regrettable-username/style-transfer