Using Neural Networks to Create Paintings

An overview of style transfer and CNNs for content and style representation

Image for post
Image for post
The painting generated by an AI that sold for $432,000

You may have heard about the painting that was sold a few months ago for over 400 thousand dollars. It was created by an AI. If you think about that for a sec, it’s crazy!

Deep learning has advanced so much in the past few decades. We’ve gone from having computers not powerful enough to do backprop to ones that are able to generate art. Art! This was literally the only thing we thought computers could never replicate.

AI is becoming more capable than ever, and one of we’re generating artistic images is through a technique called style transfer.

Neural Style Transfer

So what exactly is it? It’s basically when you take the content of one image and you combine it with the style of another. This could be something like taking a photograph of a building and taking another image of a painting. Then mashing together a photograph of a building with a painting to make the building look like it was painted the exact same way.

Image for post
Image for post
A diagram from the original style transfer paper by Gatys et. al, source.

Maybe you remember the app that went viral a few years ago, Prisma, which allowed users to make their photos look like cool artworks. Conceptually, it uses the same technique. You upload a photo and then an algorithm is applied to combine the content of your photo and the style of a filter you choose.

Now style transfer is pretty cool, but you might be wondering how exactly you can get a computer to do this. There are obviously a ton of challenges. First you need to get the computer to tell the difference between content and style. Even after that, how do you get it to combine the two? If you’re anything like me, you have difficulty painting a stick figure, much less make a photo look like it was painted by Van Gogh. So how the heck can we expect a computer to do the same?

What we can do is think of the situation as an optimization problem. We tell the computer to minimize the loss of the content as well as the loss of the style. Basically, it’s going to make a new photo and get it as close as possible to the content while still being as close as possible to the style.

Here are the loss functions that are defined for both the content and the style.

Image for post
Image for post
Content loss (left) and style loss (right), source.

Combining these two loss functions, we get:

Image for post
Image for post
The total loss function given by the sum of the content and style loss functions, source.

We’ll need two coefficients for the content loss and style loss since they need to be weighted differently. Generally, the style loss will be greater, meaning that we should make the coefficient for the style greater as well to encourage the neural network to reduce the error. Different coefficients will lead to different results but in general

Image for post
Image for post
A diagram of the ratio of the loss weights from the Udacity PyTorch course.

So this sounds pretty logical, but how do we even get the content and style in the first place? How do we get a computer to understand these abstract concepts. How do we convey that we want to keep the features of the building like the roof and windows, while getting the color and texture of another picture? This is where Convolutional Neural Networks can help us out.

Convolutional Neural Networks for Feature Representation

Convolutional Neural Networks are a type of neural network most commonly used for image recognition. You can check out my previous article on it here. CNNs are made up of convolutional layers which can be thought of as feature extractors that look for certain features in an image. Convolutional Neural Networks work by looking for simple features such as lines and edges, then in later layers, look for larger shapes that are present in an image.

Image for post
Image for post
Visualizations of filters in different convolutional layers. Source.

Convolutional filters can actually do some pretty wacky stuff. You may have heard of Google’s DeepDream before or have seen one of those pictures that make it look like you’re hallucinating. It’s likely that someone used intermediate layers of a CNN to generate those images.

Image for post
Image for post
A “DeepDream” style filter applied to Van Gogh’s Starry Night. Source.

Since layers in a CNN essentially scan for different features in an image, later layers look for larger features including eyes and mouths which helps create the trippy effect.

Using CNNs to extract content and style representations

In the original paper by Gatys et Al, certain layers of the VGG network were selected to act as content and style reconstructions. The input image is reconstructed from layers conv1_2, conv2_2, conv3_2, conv4_2 and conv5_2 with the first number being the block of convolutional layers and the second being the layer inside the block. Similarly, the style is represented by subsets of different layers in the VGG network.

Alternate Methods for Style Transfer

Although the method proposed by Gatys et. al delivers high quality results, it still faces one major issue, speed.

A paper released by Johnson et al addresses a primary disadvantage of the method proposed in the original paper. Although producing high quality results, there is computationally expensive since a forward and backward pass is require for each step of the optimization process.

Their approach, by using perceptual loss functions would allow the network to perform up to three times faster. Their project also involves an image transformation network. Instead of performing calculations to minimize the difference between the content and style of two images, a separate neural network is used to apply a style filter to another image.

Image for post
Image for post
A diagram showing the architecture that uses an Image Transformation Network.

This style transfer method is super cool because it lets us do something that wouldn’t have been possible otherwise. Real time style transfer!

The problem with the optimizing the difference between two images is that it takes super long. Up to several minutes or even hours on a regular CPU. But the best part is, since art is highly subjective, changing the model architecture won’t necessarily make anything look worse.

Here’s a demonstration of an app I created using the TensorFlow for Poets code for Android Studio.

This app takes in real-time input from the phone camera and shows the stylized version.

Key Takeaways

  1. We can perform style transfer on images by reducing the loss between content and style images.
  2. We get abstract representations for content and style from different layers in the CNN.
  3. Using a different architecture for style transfer, we can train a separate neural network for image transformation and perform real-time style transfer with an app!

Thanks for reading! If you enjoyed it, please:

  • Add me on LinkedIn and follow my Medium to stay updated with my journey
  • Leave some feedback or send me an email (
  • Share this article with your network

Written by

High school senior passionate about tech, business, and making cool things.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store