Over the past several years, Convolutional Neural Networks (CNNs) have established themselves as a state-of-the-art computer vision tool both in industry and academia. Being used in applications ranging from facial recognition to self-driving cars, they have become incredibly popular for deep learning developers. In my work at Galaxy.AI, I’ve implemented CNNs for some of the more “traditional” computer vision tasks such as image classification and object localization.
In addition to these sorts of tasks, however, CNNs have been shown to be particularly good at recognizing artistic style. Specifically, in this paper from 2015, the authors discuss how deep convolutional neural networks can distinguish between “content” and “style” in images. By writing separate loss functions for each, the authors demonstrate how CNNs can combine the style from one image with the content from other, to create new, visually appealing images. One impressive aspect of this technique is that no new network training is required — pre-trained weights such as from ImageNet work quite well.
Style transfer is a fun and interesting way to showcase the capabilities of neural networks. I wanted to take a stab at creating a bare-bones working example using the popular
keras. In this post I’ll walk you through my approach, mimicking as closely as possible the methods from the paper. The full code from this post can be found at https://github.com/walid0925/AI_Artistry .
Using only two base images at a time, we’ll be able to create AI artwork that looks something like this:
The problem at hand is that we have two base images that we want to “blend” together. One of these has the content that we wish to keep, and we will call this image p. In my case, I’ll be using this cat photograph taken from Google:
The other base image will have the style that we wish to keep, which we call a. For this, I’ve chosen a digital image of the well-known Braque piece, Violin on Palette:
Lastly, we will have our generated image x, which we initialize with random color values. This image will evolve as we minimize the content and style loss functions.
The purpose of the content loss is to make sure that the generated image x retains some of the “global” characteristics of the content image, p. For example, in our case, we want to make sure that the generated image looks like the cat in p. This means that shapes such as the cat’s face, ears, and eyes ought to be recognizable. To achieve this, the content loss function is defined as the mean squared error between the feature representations of p and x, respectively, at a given layer l.
- F and P are matrices with a number of rows equal to N and a number of columns equal to M
- N is the number of filters in layer l and M is the number of spatial elements in the feature map (height times width) for layer l
- F contains the feature representation of x for layer l
- P contains the feature representation of p for layer l
On the other hand, the style loss is designed to preserve stylistic characteristics of the style image, a. Rather than using the difference between feature representations, the authors use the the difference between Gram matrices from selected layers, where the Gram matrix is defined as:
The Gram matrix is a square matrix that contains the dot products between each vectorized filter in layer l. The Gram matrix can therefore be thought of as a non-normalized correlation matrix for filters in layer l.
Then, we can define the loss contribution from layer l as
where A is the Gram matrix for the style image a and G is the Gram matrix for the generated image x.
Ascending layers in most convolutional networks such as VGG have increasingly larger receptive fields. As this receptive field grows, more large-scale characteristics of the input image are preserved. Because of this, multiple layers should be selected for “style” to incorporate both local and global stylistic qualities. To create a smooth blending between these different layers, we can assign a weight w to each layer, and define the total style loss as:
Putting It All Together
Lastly, we just need to assign weighting coefficients to each of the content and style loss respectively, and we’re done!
This is a nice, clean formulation that allows us to tune the relative influence of both the content and style images on the generated image, using ⍺ and ß. Based on the authors’ recommendations and my own experience, ⍺ = 1 and ß = 10,000 works fairly well.
In order to start changing our generated image to minimize the loss function, we have to define two more functions to use
scipy and the
keras backend. First, a function to calculate the total loss, and second, a function to calculate the gradient. Both of these get fed as input to a scipy optimization function as the objective and gradient functions respectively. Here, we use the limited-memory BFGS algorithm.
For each of the content and style images, we extract the feature representations to construct P and A (for each selected style layer), and weight the style layers uniformly. In practice, using > 500 iterations of L-BFGS-B typically creates convincing visualizations.
Slowly but surely …
.. we start to see something that looks like a cubist-painted version of a cat! Letting it run for enough iterations:
We can reshape this to the size of the original cat picture and look at them side by side. It’s fairly easy to see that the cat’s main features, such as its eyes, nose, ears, and paw have stayed consistent in terms of their placement. They have, however, been flattened and made angular to match the style of the painting — exactly what we were hoping for!
We can try this with other content/style image combinations as well. Here is another experiment run using another photograph from Google and Van Gogh’s Starry Night Over the Rhone:
In this post we walked through one approach for implementing style transfer using
keras. But, there are many things you can do to create even more compelling pieces:
- Try using different weightings: Different image combinations may require tweaking the style loss weights, w, or adapting the values for ⍺ and ß. In some cases, a ratio of 10⁵ for ß/⍺ may work better, for example.
- Try using more style layers: This will come at a computational cost, but will transfer style more smoothly at different scales. You could try using VGG19 rather than VGG16, or a different network architecture altogether (e.g. ResNet, Inception).
- Try using multiple content layers: Multiple content layers may keep the content more coherent, but at the expense of style adaptation
- Try using multiple content/style images: You can try adding more style images to the loss function to incorporate style from multiple artists or images. Adding multiple content images may create some especially interesting (or freaky) combinations.
- Add total variation denoising: If you look closely at the images I’ve included, you can see that there is some graininess — little swirls of color swimming around. This is fairly typical for neural net visualizations, one of the reasons being the lossy compression of images into feature maps. Adding a total variation loss (as they’ve done here) can help alleviate this in the final generated image.
Hope you enjoyed following along, and good luck!