Style transfer typically requires neural networks with a well-developed hierarchy of features for calculating the loss. Thus for this purpose, the good old vgg-16 and vgg-19 architectures work very well. But inception architectures (unlike resnet architectures) also have the same property.
I wanted to see how inception architectures could be used for style transfer. Getting these architectures to work with style transfer required some tweaks. Here is a blog post describing the tweaks I had to make.
The following content images were used for these experiments:
The style image used was “The Great Wave off Kanagawa”, simply called “wave” by Katsushika Hokusai:
The code used in these experiments is available on github.
To find out, which layers I mean by Conv2d_2c_3x3, Mixed_3b etc for inception-v1, run
in the repo. Similarly for inception-v2, inception-v3, inception-v4, vgg-16 and vgg-19.
Tweak #1: Removing checkerboard artifacts
Checkerboard artifacts can occur in images generated from neural networks. They are typically caused when we use transposed 2d convolution with kernel size not divisible by stride. For a more in depth discussion on checkerboard artifacts, read this post.
Backpropagation through a convolution is transposed convolution. Thus when training an image using a loss network, checkerboard artifacts can occur when the loss network has a convolution layer with kernel size not divisible by stride.
In inception-v1, v2, v3 and v4 architectures. The first layer has stride 2, and kernel size 7 (in v1 and v2) or 3 (in v3 and v4). Both 7 and 3 are not divisible by 2. So there was a possibility that checkerboard artifacts will be created here.
To check whether this was the case or not, I trained a noise image on only the content loss using Conv2d_1a_7x7 (from inception-v1 architecture). This image was generated. This looks, normal but on zooming in, checkerboard artifacts become visible.
Solution: Remove stride=2 from the first convolution layer and replace it with stride=1. The following image is generated in this case:
As observed, checkerboard artifacts are removed completely.
Tweak #2: Getting average pooling to work
When vgg-16 and vgg-19 networks are used as loss networks, max pooling layer is replaced with average pooling. This improves the gradient flow through the loss network and causes the image to converge faster.
In inception networks, downsampling pooling layers are of stride 2, with kernel size 3. There are also pooling layers that don’t downsample (stride 1) in inception blocks and kernel size 3. I replaced all max pooling layers with avg pooling.
My intuition was that because of higher kernel size, the distribution represented by avg pooling differs from the distribution of max pooling significantly.
To test this hypothesis, I first tried recreating the original content image using max pool and avg pool. The content loss was calculated from the layer ‘Conv2d_2c_3x3’ of inception-v1 network.
The following were the images generated:
Clearly image generated using avg pooling is no good.
Next, I plotted the content loss (with max and average pooling), to see what was going on.
As visible, the content loss for avg pooling fluctuates while content loss for max pooling converges to a small value.
Next, I tried generating the image using average pooling layer with kernel size 2, instead of 3. And it worked wonderfully well. The below images show the difference:
The left is a plot of content loss between avg pooling with kernel size 2 and avg pooling with kernel size 3. As it is visible, the loss network with average pooling of size 2 converges to a small value while kernel size 3 fluctuates.
Experiments on inception networks
All the following experiments can be recreated using
python slow_style.py with the following command line arguments
As specified in this repo.
Default values of all parameters was used unless specified.
Experiment #1: Reconstructing content images from different layers of inception-v1 and compare with vgg-16
I tried reconstructing the image of brad pitt using different layers of inception-v1 and vgg-16 and then comparing. The following layers were used for inception-v1: Conv2d_2c_3x3, Mixed_3b, Mixed_3c, Mixed_4b. For reconstruction from vgg-16, the following layers were used: conv2_2, conv3_1, conv3_2, conv4_1. The rationale behind choosing these layers was their relative distance from respective pooling layers. conv2_2 and Conv2d_2c_3x3 are last layers before second pooling in vgg-16 and inception-v1 respectively, similarly conv3_1 and Mixed_3b are first layers after second pooling and so on.
style weight and tv weight was set to zero.
Again as observed, the original image can be reconstructed from the earlier layers (Conv2d_2c_3x3 and Mixed_3b).
Below were the results when using vgg-16:
In both inception-v1 and vgg-16, content is captured very well by first few layers, the original image can be completely reconstructed till conv3_2/Mixed_3c. After that, the drop in the quality of the reconstructed image is quite significant for inception-v1.
Experiment #2: Reconstructing style images from different layers of inception-v1 and compare with vgg-16
I tried reconstructing the pastiches of the style image using different layers of inception-v1 and vgg-16 and then comparing. The following layers were used in the experiment for inception-v1: Mixed_3b, Mixed_3c, Mixed_4b, Mixed_4c, Mixed_5b. content weight and tv weight was set to zero.
For vgg-16, layers used were: conv3_1, conv3_2, conv4_1, conv4_2 and conv5_1. The layers chosen were again in reference to their distance from respective pooling layers. conv3_1 and Mixed_3b are outputs of first convolution layers after second pooling in vgg-16 and inception-v1 respectively, conv3_2 and Mixed_3c after second convolution after second pooling and so on.
Below were the results when using inception-v1:
Below were the results when using vgg-16:
The pastiches generated by vgg-16 are much richer than the ones generated by inception-v1. Moreover, the pastiches of inception-v1 look more like crayons, while those of vgg-16 look like oil paintings.
Experiment #3: Train using different layers of inception-v1 network and compare with vgg-16 outputs
For inception, I used Conv2d_4a_3x3 for calculating the content loss. For style loss, I used Mixed_3b, Mixed_3c and Mixed_4b layers.
For vgg-16, I used conv2_2 for calculating the content loss. For style loss, I used conv3_1, conv3_2 and conv4_1 layers.
The content weight was 8, style weight was 3200, tv weight was 10 for both the networks.
As it can be seen, stylized images look like crayon paintings that are quite like how the pastiches looked. And it captures the style of the painter much more poorly than vgg does. Here are the vgg-16 outputs btw (using the same corresponding layers given in the previous experiment).
And these results do look like oil paintings, like the pastiches did in the previous experiment :).
Experiment #4: Train using inception-v3 networks trained on openimages and imagenet
Next, to check what difference between the images generated by inception-v3 architecture trained on imagenet and openimages, I did another experiment. For content loss, I used the layer Mixed_5b. For style loss, I used Mixed_5b, Mixed_5c, Mixed_6a one by one.
Again, as before I first tried generating the pastiches by setting content loss and tv loss to zero for both inception-v3 trained on openimages and imagenet. The following images were generated:
Somehow, the greenish shade is visible in the pastiches generating using inception-v3 trained on openimages even though it is not present in the wave image.
For generating the stylized image, content weight was 8, tv weight was 10, style weights were as given below the stylized images.
The following stylized images were generated for different layers:
As it is seen, the same architecture captures the style of the painter better when trained on more images.
Experiment #5: Train using inception v2 using style loss from different layers
I used Mixed_3b, Mixed_3c, Mixed_4a, Mixed_4b, Mixed_5a for first generating the pastiches.
Then for finding the stylized images, I used the Mixed_3b for calculating content loss:
Experiment #6: Train using inception v4 using style loss from different layers
I used Mixed_5a, Mixed_5b, Mixed_5c, Mixed_6a, Mixed_6b for first generating the pastiches (set content weight and tv weight to zero).
Next, for generating the stylized images, I used content layer Mixed_5a. Following were the images generated, content weight was 8, tv weight was 10:
Images generated using different layers are much more similar for all inception-v1, inception-v2, inception-v3 (both trained on openimages and imagenet) and inception-v4. This is in complete contrast to the observations in vgg-16/vgg-19.
Experiment #7: Train using different types of pooling (max/avg)
I used inception-v2, content layer was Mixed_3b, style layer was Mixed_3c. First like before I generated the pastiches.
Quite surprisingly, the pastiches don’t resemble the pastiches generated using vgg-16 and don’t even resemble the brushstrokes of style image. Next I generated the stylized image. The content weight was 8, style weight was 43400, tv weight was 10.
Quite in contrast to the comparison between avg and max pooling for vgg-16, the image with max pooling looks better. And the weird dotted artifacts are in the image with avg pooling. Next, I decided to plot the content loss, style loss and total loss.
The trend here is quite similar to the trend when using vgg-16 network. content loss converges to a higher value with max pooing than avg pooling.
Experiment #8: Train using different initialization (‘noise’/’content’)
Again, like the previous experiment, I used inception-v2, content layer was Mixed_3b, style layer was Mixed_3c. content weight was 8, style weight was 43400, tv weight was 10, pooling used was ‘avg’. I chose such a high style weight because with lower style weight, none of the style was visible.
As it can be observed, content loss converges to a much much higher value for noise initialization, then content. More importantly, none of the content is visible in the generated image.
Next, I tried gradually increasing the content weights, 8 to 80, 160, 320, 640. to see how the stylized image generated using ‘noise’ initialization changes. The following images were generated:
As expected, the content is more and more preserved as we increase the content weight, and the image does match the pastiches generated in the previous experiment. But none of the stylized images look as good as the ones with ‘content’ initialization. Next I tried generating the stylized image with ‘content’ initialization with zero content weight and zero tv weight.
This looks much better than any of the images generated with noise initialization.
What I can conclude from here is, to generate good images using inception-v2, initialization makes a big difference.
If you liked this article, please help others find it by clicking the little clap icon below. Thanks a lot!