(Style Transfer) How to create a “Thai Modern Artwork” without artistic skills

Where you can be an artist with some help from AI

Nongnooch Roongpiboonsopit
12 min readSep 3, 2019
The Grand Palace, Bangkok, Thailand (This image was generated via the Style Transfer algorithm from this original image)

Many people (including me) have a dream or a wish to create a unique artwork of their own but were not have enough artistic skills to make one. With recent advances in deep learning, it helps someone like me to fulfill my wish of creating my own van Gogh with a little help from AI.

“The world always seems brighter when you’ve just made something that wasn’t there before.” Neil Gaiman

Specifically, we can apply a Deep Learning method called “Style Transfer” where we can utilize a pre-trained convolutional neural network (vgg19) to extract contents from one image and styles from another image and combine them to create a brand new artwork. Sounds very interesting, isn’t it?

The approach I will show in this post is inspired by the paper “Image Style Transfer Using Convolutional Neural Networks” and the “Style Transfer” module of the Udacity’s Deep Learning Nanodegree Program where I borrowed some contents in the lab to complete and must give credit to them.

You can visit my github project here to see the detailed analysis and the original source used in this experiment.

Big Picture of the Image Style Transfer

Image Style Transfer is a deep learning algorithm that allows us to produce a new image which is a combination of the content of an arbitrary photograph with the appearance of another artwork.

As described in the paper “Image Style Transfer Using Convolutional Neural Networks”, the Style Transfer process can be done by utilizing a pre-trained network vgg19 and perform the following steps:

Illustration of the network architecture of VGG-19 model: conv means convolution, FC means fully connected (This image was uploaded by Clifford K. Yang to https://www.researchgate.net and it was downloaded from here)
  1. Collect the Content Representation of the content image — This can be done by retrieving the output of the conv4_2 layer after performing a forward pass through the vgg19 network with the content image
  2. Collect the Style Representations of the style image — This can be done by using the style image to perform a forward pass. Then, calculate a “Gram Matrix” using an output data of the convolutional layers conv1_1, conv2_1, conv3_1, conv4_1, and conv5_1. (Gram Matrix is a mathematical way of representing a similarity of prominent styles between feature maps in a particular convolutional layer)
  3. Initialize a target image — This can be done by either cloning the content image (this approach will preserve the original content of the image — contents of a final target image will be more deterministic, and also save the learning time) or creating a white noise image (this approach requires more parameter tuning)
  4. Perform a forward pass using the target image.
  5. Compute a Total Loss which is a combination of the Content Loss (Difference of the Content Representation of the target and content images) and Style Loss (Difference of the Style Representations of the target and style images)— Note that a way to compute a Total loss in the Style Transfer is quite different from how we traditionally compute a total loss of a ConvNet where, instead of computing a loss from the output of the last Fully Connected Layer (which we don’t care in this case), we compute a difference of the Content Loss and Style Loss between the input images and the target image. You can find more details about how to compute it in this Medium Post “Neural Style Transfer Tutorial -Part 1” ( by Vamshik Shetty)
  6. Using Gradient Descent to minimize a Total Loss by adjusting the target image — To emphasize on this again, during the Style Transfer process, the weights of the network are frozen. In this case, we minimize a Total Loss during a learning process by updating pixel values of the target image ,which is an input of the trained network, such that its ContentRepresentation_target_image and StyleRepresentations_target_image are closer to the original ContentRepresetation_content_image and StyleRepresentations_style_image which were computed before the training process (This is super tricky, I know…)
  7. Repeat steps 4 and 6 above for a fixed number of iterations (or until getting a desired target image)

That is a wrap! Let’s get back to our main goal to create a brand new Thai artwork! :)

Experiment Setup

In this experiment, we will create a new artwork from:

  • Content Image: An image of a God Statue in the Grand Palace, Bangkok, Thailand (This image was downloaded from here)
  • Style Image: An image of a Thai-style artwork. Sorry, I could not find a name of this drawing :( (This image was downloaded from here)
(Left) Content Image — An image of a God Statue in the Grand Palace, Bangkok, Thailand, (Right) Style Image — An image of a Thai-style artwork

Before start the learning process (Steps 3–6 described in the previous section), let’s take a look at the “Content Representation” and “Style Representations” retrieving from our images to get a better understanding of our setup first!

Content Representation

As described in the previous section, the Content Representation can be obtained by retrieving an output of the “conv4_2” layer after performing a forward pass with the content image:

Retrieving the Content Representation from the output of ‘conv4_2’ (This image was downloaded from here)

Below is the “Content Representation” of our content image (The God Statue). As you can see in the images below, each filter in the conv4_2 layer detects a different object and shape arrangement of the content image

(See more details about how are these images extracted from the network in the “Analysis_Content_Style_Representation” notebook this github project)

Content Representation: Output of filters in the “conv4_2” layer

Style Representations

As mentioned earlier, to collect the Style Representations, we need to perform a forward pass with the style image and, then, calculate a “Gram Matrix” using an output data of the convolutional layers conv1_1, conv2_1, conv3_1, conv4_1, and conv5_1.

Retrieving the Style Representations by computing Gram Matrices using the output of the ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’, ‘conv5_1’ layers(This image was downloaded from here)

As an old saying goes, “a picture is worth a thousand words”, below is my lecture note that I made as a part of my study for the Udacity’s Deep Learning Nanodegree Program. I hope it will be helpful to you guys to get a better picture of how this process is done :)

Below are heatmaps of the computed Gram Matrices. With a very quick glance, we can see that those cells on the diagonal lines have a higher value (bright color) which makes sense because it is a value of a feature map multiplies by its own transpose!

However, more interestingly, notice that there are also cells that are *NOT* on the diagonal lines but have a higher value (bright color). Those are an indication of two feature maps that are very similar.

Heatmaps of the Style Representations (Computed Gram Matrices)

To prove on my claim above, let’s take a look at the output of filters in the ‘conv1_1’ and take a look feature maps which produce the highest values (excluding those on the diagonal line)

Output of filters in the ‘conv1_1’
Top#1 filter maps which produces the highest value in the Gram Matrix (Excluding those on the diagonal line)
Top#2 filter maps which produces the highest value in the Gram Matrix (Excluding those on the diagonal line)
Top#3 filter maps which produces the highest value in the Gram Matrix (Excluding those on the diagonal line)

As you can see above, those feature maps, which produce a high value in the Gram Matrix, produce a very similar output image so we can use the Gram Matrix to tell us the similarities between features in a layer and it should give us some clue about the texture and color information found in an image!

In addition, I also run the same experiment with the conv2_1, conv3_1, conv4_1, and conv5_1 layers and below are the top feature maps which produce a high value in the Gram Matrix in those layers.

As you can see in the images below, as we go deeper in the network, a convolutional layer will detect and emphasize more on SMALLER features!

Top#1 filter maps which produces the highest value in the Gram Matrix in the “conv2_1" layer (Excluding those on the diagonal line)
Top#1 filter maps which produces the highest value in the Gram Matrix in the “conv3_1” layer (Excluding those on the diagonal line)
Top#1 filter maps which produces the highest value in the Gram Matrix in the “conv4_1” layer (Excluding those on the diagonal line)
Top#1 filter maps which produces the highest value in the Gram Matrix in the “conv5_1” layer (Excluding those on the diagonal line)

To see a complete experiment results, visit the “Analysis_Content_Style_Representation” notebook this github project

It’s time to create my own artwork!

After getting a good understanding of our setup, let’s start creating a new artwork.

I will show a resultant image after performing the steps 3–6 mentioned in “Big Picture of the Image Style Transfer” section of this post with a variation of tuning parameters but will not go into details of how to code up this piece but you can visit the “ StyleTransfer_Experiment_Results” notebook this github project for more details.

To recap again, below are our content and style images that will be used in this experiment and the initial target image is a clone of the content image:

(Left) Content Image — An image of a God Statue in the Grand Palace, Bangkok, Thailand, (Right) Style Image — An image of a Thai-style artwork

Below are parameters that were varied in this experiment:

  • Learning Rate
  • Number of learning steps
  • Content Loss Weight (alpha)
  • Style Loss Weight (beta)
  • Style weight in each layer (w_i)

Experiment#1: Higher Style weight at the earlier layers

Experiment#1: Resultant target image
  • Learning Rate: 0.003
  • Number of learning steps:

5,000

  • Content Loss Weight (alpha):

1

  • Style Loss Weight (beta):

1e6

  • Style weight in each layer:

'conv1_1': 1.0,
‘conv2_1’: 0.8,
‘conv3_1’: 0.5,
‘conv4_1’: 0.3,
‘conv5_1’: 0.1

Comments:

The resultant picture has the original contents preserved with the appearance of the style image.

Experiment#2: Higher Style weight at the deeper layers

  • Learning Rate: 0.003
  • Number of learning steps:

5,000

  • Content Loss Weight (alpha):

1

  • Style Loss Weight (beta):

1e6

  • Style weight in each layer:

‘conv1_1’: 0.1,
‘conv2_1’: 0.3,
‘conv3_1’: 0.5,
‘conv4_1’: 0.8,
‘conv5_1’: 1.0

Comments:

This resultant image looks almost identical to the previous experiment although I expected more smaller style artifacts to show up in the resultant image. Maybe we need more learning steps to see those effects?

Experiment#3: Reduce overall style weights

  • Learning Rate: 0.003
  • Number of learning steps:

5,000

  • Content Loss Weight (alpha):

1

  • Style Loss Weight (beta):

1e3

  • Style weight in each layer:

‘conv1_1’: 0.2,
‘conv2_1’: 0.2,
‘conv3_1’: 0.2,
‘conv4_1’: 0.2,
‘conv5_1’: 0.2

Comments:

This resultant image also looks almost identical to the previous experiments. Looks like we got very strong style artifacts from the style image where the chosen parameters have very little impact on them!

Experiment#4: Reduce Learning rate and Increase Steps

  • Learning Rate: 0.001
  • Number of learning steps:

7,500

  • Content Loss Weight (alpha):

1

  • Style Loss Weight (beta):

1e6

  • Style weight in each layer:

‘conv1_1’: 1.0,
‘conv2_1’: 0.8,
‘conv3_1’: 0.5,
‘conv4_1’: 0.3,
‘conv5_1’: 0.1

Comments:

This resultant image also looks almost identical than the previous experiments but seems to have crisper contents which is likely due to a smaller learning rate.

Initialize a target image is a white noise image

As you can see in the resultant images above, all of them yield in a good result in a short training time (Less than 20 minutes on my windows 10 machine with GPU) with minor tweak on the parameter tuning.

What if we initialize a target image with a white noise and use the parameter settings which yield the best result in our previous experiment. How a resultant target image is going to look like?

Experiment#wn_1: Init with white noise and use the same parameters

  • Learning Rate: 0.003
  • Number of learning steps:

20,000

  • Content Loss Weight (alpha):

1

  • Style Loss Weight (beta):

1e6

  • Style weight in each layer:

‘conv1_1’: 1.0,
‘conv2_1’: 0.8,
‘conv3_1’: 0.5,
‘conv4_1’: 0.3,
‘conv5_1’: 0.1

Comments:

I have to admit that I really love this artwork (I probably frame it :P ) but this target image does not look like our content image at all! This indicates that our Style Loss Weight (beta) is way to high in this case.

Experiment#wn_2: Init with white noise and Larger Content Loss Weight (alpha)

  • Learning Rate: 0.003
  • Number of learning steps:

5,000

  • Content Loss Weight (alpha):

100

  • Style Loss Weight (beta):

1

  • Style weight in each layer:

‘conv1_1’: 1.0,
‘conv2_1’: 0.8,
‘conv3_1’: 0.5,
‘conv4_1’: 0.3,
‘conv5_1’: 0.1

Comments:

I have to confess that I ran so many experiments before getting these parameter settings — Parameter tuning is definitely not a trivial task :(

You can see that by increasing the Content Loss weight (alpha) and exponentially reduce the Style Loss Weight (beta), we can generate a new image which have most of the contents in the content image when initializing the target image with white noise. However, to be honest with you, I do not like this resultant target that much when comparing to our previous experiments. It looks like we need to do more of parameter tuning in order to get a desired image in this case.

Conclusion

In this article, we saw how we could create a new artwork with some help from AI via the Style Transfer algorithm in Deep Learning.

Additionally, we walked through the main concepts of the Style Transfer algorithm and some gotchas including:

  • To extract the Content Representation, we can use a content image to perform a forward pass through the vgg19 network and retrieve the output of the conv4_2 layer
  • To extract the Style Representations, we can use a style image to perform a forward pass through the vgg19 network and retrieve the output of the conv1_1, conv2_1, conv3_1, conv4_1, conv5_1 layers to compute Gram Matrices (We will use the gram matrices as our style representation)
  • The Gram Matrices is a mathematical way of representing the idea of share in prominent styles (similarities) between feature maps
  • As we go deeper in the network, a convolution layer will emphasize more on detecting SMALLER features comparing to the earlier layer and we can adjust a weight value for those layers depending on how we much we would like the Larger or Smaller style artifacts to apply on the target image

The resultant target images obtained in this experiment yield a very desirable result where we were able to create a new artwork with appearance of our style image but still preserve the contents of the content image in a short learning time (around 20 minutes) when initializing a target image by cloning the content image.

However, when initializing a target image with white noise, more parameter tuning process is required to produce a new image with a right balance of contents and styles from the input content and style images.

That is it! I really enjoyed working on this experiment and very happy to share what I have learnt during this process with you and hope you guys enjoy it too! :)

To see more about this work, see the link to my Github available here.

--

--