FADL2 L8: ‘Neural’ Artistic Style Transfer

Recently finished Lesson 8 of Jeremy Howard’s Practical Deep Learning, (part 2 in beta). Here I’m going to go through a bit of the work and results from recreating the lesson’s Jupyter Notebook. <this is very rough>

[NOTE: think part 2’s been renamed to ‘Cutting Edge Deep Learning’ — should be launched publicly soon. Also tl;dr: scroll to bottom for videos]

My rendition of the notebook can be followed along here: https://github.com/WNoxchi/Kaukasos/blob/master/FAI02/L8_NeuralArtTrsfr_CodeAlong.ipynb

After reviewing the library imports and the need to specify limit_mem() so TensorFlow doesn’t eat everything your GPU has to offer, I had to learn what ‘pickling’ was in Python. Turns out, courtesy of sentdex on YouTube, it’s just serialization in Python. What serialization? Obviously it’s sending a stream of one thing after another… apparently it’s useful when you’re doing a lot of disk accessing (like reading data for ML), although it provides no security, so beware for network streams.

I experimented by writing a mini dictionary of Latin:Mxedruli letters:

Learning to Pickle in Python (data serialization)

So that’s that. I haven’t looked at sentdex’s newer ‘Pickling and Scaling — Practical Machine Learning with Python p.6’ video, but I may if I feel I need to.


Neural Style Transfer

NST is creating a new image that retains the ‘style’ of one image, and the form of another. So for example creating a picture of a cat painted in Vincent van Gogh’s style by using two input pictures: that of a cat, and another representing van Gogh’s painting style.

A few things are important here. The algorithm isn’t applying van Gogh’s style to the cat image itself: it’s starting with a fresh image (like a random bunch of pixels) and building it into the desired composite.

It’s going to need a model that’s already good/-enough at discerning features of both input images. So, “where are the eyes”, “this is a face”, “this brush-stroke is a thing”, etc.

First step is initializing the model (here a modified VGG16 CNN using Average Pooling and minus the top block of FC layers):

instantiating a VGG16 model using Average-Pooling instead of Max, and excluding the end-block of Fully-Connected layers (we want convolutional features, not classification here)

Next we define a computation graph in Keras. A computation graph, as I understand it so far, is defining a series of operations on data stand-ins. No data’s been passed in yet, but when it is, the program will know what to do. We do this to define the target image

`targ` isn’t any ‘thing’ yet, but more of an operational placeholder for when this bit of computation occurs.

As you can see, targ is the output (prediction) of the layer_model on the img_arr. layer_model is a Model ( keras.models.Model : Keras Functional API ) made up of the layers of model from it’s input to layer, which we define to be the 1st Convolutional layer of the 5th Conv-Block of the VGG16 model. img_arr is the original image which has been preprocessed the same way the VGG authors preprocessed their ImageNet data in training their model.

NOTE: layer is the output-activations from the end of our convolutional submodel. layer_model is the submodel defined from the beginning of the original model to layer. targ is the target-activations. The loss-function attempts to bring layer and targ as close together as possible.

That wording can be confusing, but it makes more sense when you go through it in code. We define several such graphs using Keras throughout this notebook.

In order for the algorithm to gauge how well it’s doing it needs to quantize it’s progress somehow: a Loss / Objective Function.

To do this, we define the loss as being the Mean-Squared Error between the output layer and the target image.

evaluator is a custom-defined Evaluator class that allows the program to access the loss function and gradients separately, since Keras returns them together but the optimizer needs to see them separately.

This evaluator, along with the number of iteraions, and the random-image x, go into solve_image(•), which does the job of optimizing the loss.

All this is to recreate an image from an original by applying its conv-features to a randomly-generated image and updating the loss function.

The basic idea is: we run the original and a new random image through the submodel, and use our loss function to determine how close of a match they are. We use the Line-Search algorithm from scipy.optimize.fmin_l_bgfs_b as the optimizer to minimize the loss. When we next work on incorporating style from a style-image, we’ll be using weights to balance how much we want to focus on recreating the original image, vs applying the style of style image.

`loss` is the sum of `style_loss`-es for all layers & targets. `style_loss` is the mean-squared-error between the gramian matrix of the layer, and the gramian matrix of the target. The Gram Matrix is the dot-product of a matrix with its own transpose. ~ I think I remember in lecture, Howard saying no one really knows why it works … but it works.

So what we eventually get, using my cat Terek as the original image:


And generating an image of random pixels:

And putting that through solve_image(•) for 10 iterations:

We get this unGodly monstrosity:

How cats actually see
Here’s an animation of it (the videos in this post were all first taken on Instagram as I was coding)

As bad as this may look, it’s actually really interesting. Remember this algorithm is taking a mosaic or random pixels and recreating a picture of a cat by feeding them through the CNN we defined above, and optimizing the weights & activations of that network based on the mean-squared error between the new (target) image and the original’s activations in the network. What this means is that the target image will be constructed according to what features are expressed in the convolutional network. So the model has to actually be able to tell eyes apart from the background and so on. Lower convolutional blocks will give us finer features, and later ones will give us the more macro features in the model. So we can get an output that looks more like the original image by using earlier convolutional layers.


For recreating style, we’ll have to do something more. We’re changing the loss function by calculating it for multiple layers. Quoting the original notebook:

Whereas before we were calculating MSE of the raw convolutional outputs, here we transform them into the ‘Gramian Matrix’ of their channels (that is, the product of a matrix and its transpose) before taking their MSE. It’s unclear why this helps us achieve our goal, but it works. One thought is that the Gramian shows how our features at that convolutional layer correlate, and completely removes all location information. So matching the Gram Matrix of channels can only match some type of texture information, not location information.
style_arr is the preprocessed style image. shp is the shape of style_arr. Note the major difference from before being that we’re building arrays from layers. Specifically 3 layers: for the 3 color-channels.
`loss` is the sum of `style_loss`-es for all layers & targets. `style_loss` is the mean-squared-error between the gramian matrix of the layer, and the gramian matrix of the target. The Gram Matrix is the dot-product of a matrix with its own transpose. ~ I think I remember in lecture, Howard saying no one really knows why it works … but it works.

Doing that, we can the style of this van Gogh painting:

And apply that style to this randomly generated target image:

To get an image of random pixels, in van Gogh’s style after 10 iterations:


And finally for Style Transfer. We got 1. Input Recreation. 2. Style Recreation. Now to put it together:

This is done in the (eventually) intuitive way: combine the two approaches by weighting and adding the two loss functions. As before, we grab a sequence of layer outputs to compute style loss. Only one layer output is needed to compute content loss. The lower the layer: the more exact content reconstruction. When we merge content reconstruction with style: the earlier-layer / looser reconstruction will allow more room for style, and vice versa.

Here we have code to make the style_layers be the 2nd convolutional layer of blocks 1 through 5; and the content_layer the outputs (activations) of block 4 conv-layer 2:

The style and content models become submodels of the full model’s input to the respective style and content outputs. NOTE: the style target outputs/activations correspond to their multiple layers. The style targets ultimate take in the style array (preprocessed style image) and likewise content target takes in the preprocessed source image (src).

This creates three separate output types: for original image, style-image, and the random image who’s pixels we’re training.

Finally an array of weights to apply to the style output activations is created. This tunes the mix at those layers of image-reconstruction vs. style-transfer. NOTE: this works by being a factor on content-loss. The smaller the factor (greater denominator) the greater effect style has on the final image, until it completely obscures any original content.

Gotta find that sweet spot between all-style and all-business.

This does the thing

And here are some results:

I accidentally got an impressionist painting of a power plug:

Convolutional Neural Networks need a fixed input size, so the content, style, and output images have to be of the same dimensions. So while getting a compatible cropping of my cat’s picture, I forgot to adjust it, getting the top-left corner, of the same dimensions as a tiny bird-painting I was using for style.

Artistic inspiration brought to you by Snapchat…

And finally, just making sure I was doing things right: seeing if I could get the same results on the same input images as Jeremy Howard:

Definitely some differences, for sure similarities, I like this for a first run. I’m definitely excited about doing good style transfer like what you’ll find here: https://github.com/titu1994/Neural-Style-Transfer

Luckily enough, it looks like techniques such as ‘masking’ and others are going to be covered in Lesson 9, so I may not even need to learn that on the side.

So, final note: this wasn’t a post for learning from scratch. For that: you’ll have to go through the lesson and original notebook. But if you’re working on this and wanted some extra thoughts, then hopefully this was helpful.

Another thing I love about AI are all the opportunities for sketchy-but-accurate folder names:

A big thanks to Jeremy Howard and Rachel Thomas for putting together fast.ai.