Picking an optimizer for Style Transfer

Have you ever woken up in the middle of the night and wondered whether Gradient Descent, Adam or Limited-memory Broyden–Fletcher–Goldfarb–Shanno will optimize your style transfer neural network faster and better?

Me neither, and I didn’t know what half of these words meant until last week. But as I’m participating in part 2 of the excellent Practical Deep Learning For Coders, and nudged by the the man behind the course Jeremy Howard, I figured I’d explore which optimizer is best for this task.

NB: Some knowledge of Convolutional Neural Networks is assumed. I highly recommend part 1 of the course which is the available online for free. It’s the best way to get your feet wet in machine learning (ML).

What is Style transfer and how does it work?

Let’s start with some of the basics, partly because I was a little unclear of those prior to writing this. If you are familiar with style transfer, you might skim/skip this section.

Q: Um, what is style transfer?

A: It’s what apps like Prisma and Lucid are doing. Basically, it extracts the style of an image (usually a famous painting) and applies it to the contents of another image (usually supplied by the user).

The style of an painting is: the way the painter used brush strokes; how these strokes form objects; texture of objects; color palette used.

The content of the image is what objects are present in this image (person, face, dog, eyes, etc.) and their relationships in space.

Here is an example of style transfer:

Landscape (content) + Scream (style)
Q: But how do I separate the style and content of an image?

A: Using convolutional neural networks(CNNs). Since AlexNet successfully applied CNNs to object recognition (figuring out what object is in an image), and dominated the most popular computer vision competition in 2012, CNNs have been the most popular and effective method for object recognition. They recognize objects by learning layers of filters that build on previous layers. The first layers learn to recognize simple patterns, for example an edge or a corner. Intermediate layers might recognize more complex patterns like an eye or a car tire, and so on. Jason Yosinski shows CNNs in action in this fun video.

It turns out that the filters in the first layers in CNNs correspond to the style of the painter — brush strokes, textures, etc. Filters in later layers happen to locate and recognize major objects in the image, such as a dog, a building or a mountain.

By passing a Picasso painting through a CNN, and noticing how much filters in the first layers (style layers) are activated, we can obtain a representation of the style Picasso used in it. Same thing for the content image, but this time with the filters in the last layers (content layers).

Q: Ok, then how do I combine the style and content?

A: Now it gets interesting. We can compute the difference between the styles of two images (style loss) as the difference between the activations of the style filters for each image. Same thing for the difference between content of two images (content loss): it’s the difference between the activations of the filters in the content layers of each image.

Let’s say we want to combine a the style of Picasso painting with a picture of me. The combination image starts off as random noise. As the combination image goes through the CNN, it excites some filters in the style layers and some in the content layers. By summing the style loss between the combination image and the Picasso painting, and the content loss between the combination image and my picture, we get the total loss.

Content, Style and the initial combination image

If we could change the combination image as to minimize the total loss, we would have an image that is as close to both the Picasso painting and that picture of me. We can do this with an optimization algorithm.

Q: An optimization algorithm?

A: It’s a way to minimize (or maximize) a function. Since we have a total loss function that is dependent on the combination image, the optimization algorithm will tell us how to change the combination image to make the loss a bit smaller.

Q: What optimization algorithms are there?

The ones I’ve encountered so far fall in two camps: first- and second-order methods.

First-order methods minimize or maximize the function (in our case the loss function) using its gradient. Most widely used first-order method is Gradient Descent and its variants, as illustrated here and explained in Excel(!).

Second-order method use the second derivative (Hessian) to minimize or maximize the function. Since the second derivative is costly to compute, the second-order method in question, L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno) uses an approximation of the Hessian.

Which is the best optimization algorithm?

It depends on the task, but I mostly use Adam. Let’s see which one is fastest in style transfer.

Setup: The learning rate in the following experiments is set to 10 which might seem high but works out fine because we are dealing with color intensities between 0 and 255. The rest of the hyperparameters of the optimizers are left at their default values. The tests were performed on a single K80 GPU on Amazon P2 Instance.

Experiment 1: 100 iterations, 300 x 300 pixels

We’ll start off with Picasso and a picture of a beautiful girl. They are both sized 300 by 300 pixels.

We’ll run the optimizer for a 100 steps. It’s not sufficient to get a good combination image, but will allow us to see which optimizer minimizes the error faster.

It seems that Gradient Descent and Adadelta have a hard time minimizing the loss as indicated by their oscillation. RMSProp is similar to Adadelta, so it’s behaving in the same way. This is caused by the large learning rate.
What is surprising is that Adam and L-BFGS converge fast and generally have the same error.

Just for reference, here is the output for each optimizer:

Experiment 2: 100 iterations, 600 x 600 images

L-BFGS should work better when there are a large number of parameters. To test this, we increase the image size. We’ll also switch up the images we use with Van Gogh’s “Starry night” and a cute retriever.

And the results are in:

Gradient Descent and Adadelta show less variation this time, even with the large learning rate, but RMSProp is unstable.

Adam converges faster but over the long term L-BFGS achieves lower loss. I wonder how that will play out over a longer period of time. Let’s try.

Experiment 3: 1000 iterations, 300 x 300 images

To test this, we are going to run for 1000 iterations on the same 300 by 300 images from the previous test.

The results look pretty bad for Gradient descent, Adadelta, and RMSProp and they are unable to converge with this learning rate, as can be seen in the actual combined images (further down the page).

Let’s focus on the best performing optimizers then:

L-BFGS seems to be a clear winner here as it achieved a 50% smaller loss than Adam (and faster too).

The difference can be seen in the actual combination images:

Clearly Adam, Adagrad and L-BFGS look better.

Experiment 4: Different Learning Rate, 100 iterations, 300 x 300

As it was pointed on this reddit thread, the large learning rate is preventing GD, Adadelta and RMSProp of converging. So let’s reduce the learning rate for these 3:

Okay, this is much more interesting. Gradient Descent is performing better than Adadelta (which might benefit from an even lower learning rate). It seems that the large learning rate that “Adam LR:10" has helps it lower the loss quickly in the beginning but is having trouble converging towards the end. Adam with a learning rate of 1 is slowly but steadily decreasing the loss. Will it will perform better than Adam with LR 10 given enough time? The next experiment will show us.

Experiment 5: Different Learning Rates, 500 iterations, 300 x 300 images

Gradient Descent and Adadelta begin oscillating towards the end, and they will benefit from a further reduced learning rate at this point. Same for RMSProp.

Interestingly, Adam with LR of 1 overtakes Adam with LR 10 given enough time, and might eventually perform better than L-BFGS (in the next test).

Experiment 5: 1000 iterations, 300 x 300 images

Adam is still unable to achieve lower loss than L-BFGS. Maybe learning rate annealing will help it, which I intend to review in a follow up post.

The code used to generate these is available at: https://github.com/slavivanov/Style-Tranfer