Getting Inception Architectures to Work with Style Transfer

Style transfer typically requires neural networks with a well-developed hierarchy of features for calculating the loss. Thus for this purpose, the good old vgg-16 and vgg-19 architectures work very well. But inception architectures (unlike resnet architectures) also have the same property.

I wanted to see how inception architectures could be used for style transfer. Getting these architectures to work with style transfer required some tweaks. Here is a blog post describing the tweaks I had to make.

The following content images were used for these experiments:

left: brad pitt, right: MIT stata center

The style image used was “The Great Wave off Kanagawa”, simply called “wave” by Katsushika Hokusai:

“The Great Wave off Kanagawa”, also called “wave” by Katsushika Hokusai

All pretrained loss networks used in these experiments were downloaded from tensorflow slim-models repository. inception-v3 trained on openimages was obtained from this script.

The code used in these experiments is available on github.

To find out, which layers I mean by Conv2d_2c_3x3, Mixed_3b etc for inception-v1, run

in the repo. Similarly for inception-v2, inception-v3, inception-v4, vgg-16 and vgg-19.

Tweak #1: Removing checkerboard artifacts

Checkerboard artifacts can occur in images generated from neural networks. They are typically caused when we use transposed 2d convolution with kernel size not divisible by stride. For a more in depth discussion on checkerboard artifacts, read this post.

Backpropagation through a convolution is transposed convolution. Thus when training an image using a loss network, checkerboard artifacts can occur when the loss network has a convolution layer with kernel size not divisible by stride.

In inception-v1, v2, v3 and v4 architectures. The first layer has stride 2, and kernel size 7 (in v1 and v2) or 3 (in v3 and v4). Both 7 and 3 are not divisible by 2. So there was a possibility that checkerboard artifacts will be created here.

To check whether this was the case or not, I trained a noise image on only the content loss using Conv2d_1a_7x7 (from inception-v1 architecture). This image was generated. This looks, normal but on zooming in, checkerboard artifacts become visible.

Image generated using original inception-v1 network (left image has checkerboard artifacts that are visible on zooming in)

Solution: Remove stride=2 from the first convolution layer and replace it with stride=1. The following image is generated in this case:

Image generated using modified inception-v1 network (checkerboard artifacts reduced on zooming in)

As observed, checkerboard artifacts are removed completely.

Tweak #2: Getting average pooling to work

When vgg-16 and vgg-19 networks are used as loss networks, max pooling layer is replaced with average pooling. This improves the gradient flow through the loss network and causes the image to converge faster.

In inception networks, downsampling pooling layers are of stride 2, with kernel size 3. There are also pooling layers that don’t downsample (stride 1) in inception blocks and kernel size 3. I replaced all max pooling layers with avg pooling.

My intuition was that because of higher kernel size, the distribution represented by avg pooling differs from the distribution of max pooling significantly.

To test this hypothesis, I first tried recreating the original content image using max pool and avg pool. The content loss was calculated from the layer ‘Conv2d_2c_3x3’ of inception-v1 network.

The following were the images generated:

images generated by training: left(avg pooling) right(max pooling)

Clearly image generated using avg pooling is no good.

Comparing content loss of image with avg and max pooling

Next, I plotted the content loss (with max and average pooling), to see what was going on.

As visible, the content loss for avg pooling fluctuates while content loss for max pooling converges to a small value.

Next, I tried generating the image using average pooling layer with kernel size 2, instead of 3. And it worked wonderfully well. The below images show the difference:

left (image generated using pooling layer of kernel size2), right (image generated using kernel size 3)
Plot of content loss for avg pooling with ksize 2 and ksize 3

The left is a plot of content loss between avg pooling with kernel size 2 and avg pooling with kernel size 3. As it is visible, the loss network with average pooling of size 2 converges to a small value while kernel size 3 fluctuates.

Experiments on inception networks

All the following experiments can be recreated using

python slow_style.py with the following command line arguments

As specified in this repo.

Default values of all parameters was used unless specified.

Experiment #1: Reconstructing content images from different layers of inception-v1 and compare with vgg-16

I tried reconstructing the image of brad pitt using different layers of inception-v1 and vgg-16 and then comparing. The following layers were used for inception-v1: Conv2d_2c_3x3, Mixed_3b, Mixed_3c, Mixed_4b. For reconstruction from vgg-16, the following layers were used: conv2_2, conv3_1, conv3_2, conv4_1. The rationale behind choosing these layers was their relative distance from respective pooling layers. conv2_2 and Conv2d_2c_3x3 are last layers before second pooling in vgg-16 and inception-v1 respectively, similarly conv3_1 and Mixed_3b are first layers after second pooling and so on.

style weight and tv weight was set to zero.

Image reconstructed from Conv2d_2c_3x3, Mixed_3b (left to right)
Image reconstructed from Mixed_3c and Mixed_4b (left to right)

Again as observed, the original image can be reconstructed from the earlier layers (Conv2d_2c_3x3 and Mixed_3b).

Below were the results when using vgg-16:

Image reconstructed from conv2_2, conv3_1 (left to right)
Image reconstructed from conv3_2, conv4_1 (left to right)

In both inception-v1 and vgg-16, content is captured very well by first few layers, the original image can be completely reconstructed till conv3_2/Mixed_3c. After that, the drop in the quality of the reconstructed image is quite significant for inception-v1.

Experiment #2: Reconstructing style images from different layers of inception-v1 and compare with vgg-16

I tried reconstructing the pastiches of the style image using different layers of inception-v1 and vgg-16 and then comparing. The following layers were used in the experiment for inception-v1: Mixed_3b, Mixed_3c, Mixed_4b, Mixed_4c, Mixed_5b. content weight and tv weight was set to zero.

For vgg-16, layers used were: conv3_1, conv3_2, conv4_1, conv4_2 and conv5_1. The layers chosen were again in reference to their distance from respective pooling layers. conv3_1 and Mixed_3b are outputs of first convolution layers after second pooling in vgg-16 and inception-v1 respectively, conv3_2 and Mixed_3c after second convolution after second pooling and so on.

Below were the results when using inception-v1:

Style outputs from Mixed_3b, Mixed_3c, Mixed_4b (left to right)
Style outputs from Mixed_4c and Mixed_5b (left to right)

Below were the results when using vgg-16:

Pastiches generated using conv3_1, conv3_2, conv4_1 (left to right)
Pastiches generated using conv4_2 (left) and conv5_1 (right)

The pastiches generated by vgg-16 are much richer than the ones generated by inception-v1. Moreover, the pastiches of inception-v1 look more like crayons, while those of vgg-16 look like oil paintings.

Experiment #3: Train using different layers of inception-v1 network and compare with vgg-16 outputs

For inception, I used Conv2d_4a_3x3 for calculating the content loss. For style loss, I used Mixed_3b, Mixed_3c and Mixed_4b layers.

For vgg-16, I used conv2_2 for calculating the content loss. For style loss, I used conv3_1, conv3_2 and conv4_1 layers.

The content weight was 8, style weight was 3200, tv weight was 10 for both the networks.

stylized images generated using Mixed_3b (left) and Mixed_3c (right)
stylized images generated using Mixed_4b

As it can be seen, stylized images look like crayon paintings that are quite like how the pastiches looked. And it captures the style of the painter much more poorly than vgg does. Here are the vgg-16 outputs btw (using the same corresponding layers given in the previous experiment).

stylized images generated using conv3_1 (left) and conv3_2 (right)
stylized images generated using conv4_1

And these results do look like oil paintings, like the pastiches did in the previous experiment :).

Experiment #4: Train using inception-v3 networks trained on openimages and imagenet

Next, to check what difference between the images generated by inception-v3 architecture trained on imagenet and openimages, I did another experiment. For content loss, I used the layer Mixed_5b. For style loss, I used Mixed_5b, Mixed_5c, Mixed_6a one by one.

Again, as before I first tried generating the pastiches by setting content loss and tv loss to zero for both inception-v3 trained on openimages and imagenet. The following images were generated:

Pastiches generated from Mixed_5b, Mixed_5c, Mixed_6a (left to right) for inception-v3 trained on imagenet
Pastiches generated from Mixed_5b, Mixed_5c, Mixed_6a (left to right) for inception-v3 trained on openimages

Somehow, the greenish shade is visible in the pastiches generating using inception-v3 trained on openimages even though it is not present in the wave image.

For generating the stylized image, content weight was 8, tv weight was 10, style weights were as given below the stylized images.

The following stylized images were generated for different layers:

layer Mixed_5b, style weight 6400: inception-v3 trained on imagenet (left), inception-v3 trained on openimages (right)
layer Mixed_5c, style weight 6400: inception-v3 trained on imagenet (left), inception-v3 trained on openimages (right)
layer Mixed_6a, style weight 64000: inception-v3 trained on imagenet (left), inception-v3 trained on openimages (right)

As it is seen, the same architecture captures the style of the painter better when trained on more images.

Experiment #5: Train using inception v2 using style loss from different layers

I used Mixed_3b, Mixed_3c, Mixed_4a, Mixed_4b, Mixed_5a for first generating the pastiches.

images generated using style layers: Mixed_3b, Mixed_3c, Mixed_4a (left to right)
images generated using style layers: Mixed_4b and Mixed_5a (left to right)

Then for finding the stylized images, I used the Mixed_3b for calculating content loss:

stylized images generated: using Mixed_3b, style weight-15200 (left) and Mixed_3c, style weight-43400 (right)
stylized images generated: using Mixed_4a, style weight-29700 (left) and Mixed_4b, style weight-140700 (right)
stylized images generated: using Mixed_5a, style weight-234700

Experiment #6: Train using inception v4 using style loss from different layers

I used Mixed_5a, Mixed_5b, Mixed_5c, Mixed_6a, Mixed_6b for first generating the pastiches (set content weight and tv weight to zero).

Pastiches generated using Mixed_5a, Mixed_5b, Mixed_5c (left to right)
Pastiches generated using Mixed_6a, Mixed_6b (left to right)

Next, for generating the stylized images, I used content layer Mixed_5a. Following were the images generated, content weight was 8, tv weight was 10:

stylized images generated: using Mixed_5a, style weight-21300 (left) and Mixed_5b, style weight- 41600 (right)
stylized images generated: using Mixed_5c, style weight-24100 (left) and Mixed_6a, style weight-85000 (right)
stylized images generated: using Mixed_6b, style weight (127000)

Images generated using different layers are much more similar for all inception-v1, inception-v2, inception-v3 (both trained on openimages and imagenet) and inception-v4. This is in complete contrast to the observations in vgg-16/vgg-19.

Experiment #7: Train using different types of pooling (max/avg)

I used inception-v2, content layer was Mixed_3b, style layer was Mixed_3c. First like before I generated the pastiches.

Pastiche generated using avg pooling (left) and max pooling (right)

Quite surprisingly, the pastiches don’t resemble the pastiches generated using vgg-16 and don’t even resemble the brushstrokes of style image. Next I generated the stylized image. The content weight was 8, style weight was 43400, tv weight was 10.

stylized image generated using avg pooling
stylized image generated using max pooling

Quite in contrast to the comparison between avg and max pooling for vgg-16, the image with max pooling looks better. And the weird dotted artifacts are in the image with avg pooling. Next, I decided to plot the content loss, style loss and total loss.

content loss, style loss and total loss of image as it gets trained

The trend here is quite similar to the trend when using vgg-16 network. content loss converges to a higher value with max pooing than avg pooling.

Experiment #8: Train using different initialization (‘noise’/’content’)

Again, like the previous experiment, I used inception-v2, content layer was Mixed_3b, style layer was Mixed_3c. content weight was 8, style weight was 43400, tv weight was 10, pooling used was ‘avg’. I chose such a high style weight because with lower style weight, none of the style was visible.

image generated using initial ‘content’ (left) and ‘noise’ (right)
content loss, style loss and total loss of image as it gets trained

As it can be observed, content loss converges to a much much higher value for noise initialization, then content. More importantly, none of the content is visible in the generated image.

Next, I tried gradually increasing the content weights, 8 to 80, 160, 320, 640. to see how the stylized image generated using ‘noise’ initialization changes. The following images were generated:

stylized image generated with content weight 80 (left) and 160 (right)
stylized image generated with content weight 320 (left) and 640 (right)

As expected, the content is more and more preserved as we increase the content weight, and the image does match the pastiches generated in the previous experiment. But none of the stylized images look as good as the ones with ‘content’ initialization. Next I tried generating the stylized image with ‘content’ initialization with zero content weight and zero tv weight.

image generated with zero content weight and tv weight

This looks much better than any of the images generated with noise initialization.

What I can conclude from here is, to generate good images using inception-v2, initialization makes a big difference.

If you liked this article, please help others find it by clicking the little clap icon below. Thanks a lot!