All you need for Photorealistic Style Transfer in PyTorch
It follows from the paper High-Resolution Network for Photorealistic Style Transfer. I discuss the paper details and the pytorch code. My code implementation can be found in this repo. The official code release for the paper can be found here.
Use this model as your de-facto model for style transfer.
- What is Style Transfer?
- So why another paper?
- Gram Matrix
- High-Resolution Models
- Style Transfer Details
- Hi_Res Generation Network
- Loss Functions
- Difficult Part
What is Style Transfer?
We have two images as input one is content image and the other is style image.
Our aim is to transfer the style from style image to the content image. This looks something like this.
So why another paper?
Earlier work on style transfer although successful was not able to maintain the structure of the content image. For instance, see Fig2 and then see the original content image in Fig1. As you can see the curves and structure of the content image are not distorted and the output image has the same structure as content image.
The main idea behind the paper is using Gram Matrix for style transfer. It was shown in these 2 papers that Gram Matrix in feature map of convolutional neural network(CNN) can represent the style of an image and propose the neural style transfer algorithm for image stylization.
- Texture Synthesis Using Convolution Neural Networks by Gatys et al. 2015
- Image Style Transfer Using Convolutional Neural Networks by Gatys et al. 2016
Details about gram matrix can be found on wikipedia. Mathematically, given a vector V gram matrix is computed as
It is a recent research paper accepted at CVPR 2019 paper. So generally what happens in CNNs is we first decrease the image size while increasing the number of filters and then increase the size of the image back to the original size.
Now this forces our model to generate output images from a very small resolution and this results in loss of finer details and structure. To counter this fact High-Res model was introduced.
High-resolution network is designed to maintain high-resolution representations through the whole process and continuously receive information from low-resolution networks. So we train our models on the original resolution.
Example of this model would be covered below. You can refer to the original papers for more details on this. I will cover this topic in detail in my next week blog post.
Style Transfer Details
The general architecture of modern deep learning style transfer algorithms looks something like this.
There are three things that style transfer model needs
- Generating model:- It would generate the output images. In Fig4 this is ‘Hi-Res Generation Network’
- Loss functions:- Correct choice of loss functions is very important in case you want to achieve good results.
- Loss Network:- You need a CNN model that is pretrained and can extract good features from the images. In our case, it is VGG19 pretrained on ImageNet.
So we load VGG model. The complete code is available at my GitHub repo.
Next we load our images to disk.
My images are stored as src/imgs/content.png and src/imgs/style.png.
Detail:- When we load our images, what sizes should we use? Your content image size should be divisible by 4, as our model would downsample images 2 times. For style images, do not resize them. Use their original resolution.
For the images I am using the size of content image is (500x500x3) and size of style image is (800x800x3).
Hi_Res Generation Network
The model is quite simple we start with 500x500x3 images and maintain this resolution for the complete model. We downsample to 250x250 and 125x125 and then fuse these back together with 500x500 images.
- No pooling is used (as pooling causes loss of information). Instead strided convolution (i.e. stride=2) are used.
- No dropout is used. But if you need regularization you can use weight decay.
- 3x3 conv kernels are used everywhere with padding=1.
- Zero padding is only used. Reflex padding was tested but the results were not good.
- For upsampling,’bilinear’ mode is used.
- For downsampling, conv layers are used.
- InstanceNorm is used.
Residual connections are used between every block. We use BottleNeck layer from the ResNet architecture. (In Fig5 all the horizontal arrows are bottleneck layers).
Refresher on bottleneck layer.
Now we are ready to implement our style_transfer model, which we call HRNet (based on the paper). Use the Fig5 as reference.
In style transfer we use feature extraction, to calculate the value of losses. Feature extraction put in simple terms, means you take a pretrained imagenet model and pass your images through it and store the intermediate layer outputs. Generally, VGG model is used for such tasks.
So you take the outputs from the conv layers. Like for the above fig, you can take the output from the second 3x3 conv 64 layer and then 3x3 conv 128.
To extract features from VGG we use the following code.
We use 5 layers in total for feature extraction. Only conv4_2 is used as layer for content loss.
Refer to Fig4, we pass our output image from HRNet and the original content and style image through VGG.
There are two losses
- Content Loss
- Style Loss
Content Loss: Content image and the output image should have a similar feature representation as computed by loss network VGG. Because we are only changing the style without any changes to the structure of the image.
For the content loss, we use Euclidean distance as shown by the formula
Phi_j means we are referring to the activations of the j-th layer of loss network. In code it looks like this
Style Loss: We use gram matrix for this. So style of an image is given by its gram matrix. Our aim is to make style of two images close, so we compute the difference of gram matrix of style image and output image and then take their Frobenius norm.
To compute our final losses, we multiply them with some weights.
content_loss = content_weight * content_loss
style_loss = style_weight * style_loss
The difficulty comes in setting these values. If you want some desired output, then you would have to test different values before you get your desired result.
To build your own intuitions you can choose two images and try different range of values. I am working on providing like a summary of this. It will be available in my repo README.
Paper recommends content_weight = [50, 100] and style_weight = [1, 10].
Well, congratulation made it to the end. You can now implement style transfer. Now read the paper for more details on style transfer.
Check out my repo README, it will contain the complete instructions on how to use the code in the repo, along with complete steps on how to train your model. I will be adding video support as well, in the coming week. So you can transfer your style for all frames in a video. I am experimenting with cyclic learning for style transfer. Will add support for fastai as well.
My earlier posts:
- SPADE: State of the art in Image-to-Image Translation by Nvidia
- Weight Standardization: A new normalization in town
- Training AlexNet with tips and checks on how to train CNNs