pystiche: A Framework for Neural Style Transfer

Philip Meier
PyTorch
Published in
6 min readDec 10, 2020
content + artistic style = new image
Merge arbitrary content and artistic styles into one image with pystiche.

Introduction

pystiche is a framework for Neural Style Transfer (NST) built upon and fully compatible with PyTorch. An NST enables a fully automatic merge of arbitrary content and artistic style as can be seen in the image above.

pystiche is peer-reviewed by pyOpenSci and is published in the Journal of Open Source Software (JOSS). The name of the project is a pun on pastiche meaning:

A pastiche is a work of visual art […] that imitates the style or character of the work of one or more other artists. Unlike parody, pastiche celebrates, rather than mocks, the work it imitates.

Motivation

The seminal work of Gatys, Ecker, and Bethge gave birth to the field of
Neural Style Transfer in 2015. Unlike Deep Learning (DL), there still exists no library or framework for implementing NST. Thus, authors of new NST techniques either implement everything from scratch or base their implementation upon existing ones of other authors. Both ways have their downsides: while the former dampens innovation due to the lengthy implementation of reusable parts, with the latter the author inherits the technical debt due to the rapid development pace of DL hard and software.

pystiche was developed to overcome this particular issue. Although its core audience are researchers, its easy-to-use user interface opens up the field of NST for recreational use by laypersons.

Preliminaries

Before we dive in the implementation, let's take a step back and review how an NST is performed. It is posed as an optimization problem with two possible approaches: image-based and model-based optimization. While pystiche is well able to handle the latter it is more elaborate and thus we stick to the former for this post.

With an image-based approach the pixels of an image iteratively adapted, i.e. trained, to fit a loss function called the perceptual loss. The perceptual loss is the core part of an NST and is divided into the content loss and the style loss. These partial losses assess how well the output image matches the target images. In contrast to traditional style transfer methods, the perceptual loss comprises a multi-layered model called encoder, which is the reason why pystiche is built upon PyTorch.

With this minimal description of NST we are now ready to look at the actual implementation. For more background information, you can head over to pystiche’s documentation, which explains these concepts in more detail.

Usage example

This example showcases how to use pystiche to generate the output image in the “equation” above. Lets start by importing everything we need.

Additionally, we select the device we will be working on. While pystiche is designed to be device-agnostic, the NST can be sped up by multiple orders of magnitude with a GPU.

Imports and device selection.
pystiche==0.7.0

Multi-layer encoder

The content_loss and the style_loss, which we will define in a moment, operate on the encodings of an image rather than on the image itself. These encodings generated by the pretrained encoder on various levels. pystiche defines the enc.MultiLayerEncoder class that handles this efficiently in a single forward pass. In this example we use the vgg19_multi_layer_encoder that is based on the VGG19 architecture. By default this loads the weights provided by torchvision.

Multi-layer encoder.
VGGMultiLayerEncoder(
arch=vgg19, framework=torch, allow_inplace=True
(preprocessing): TorchPreprocessing(
(0): Normalize(
mean=('0.485', '0.456', '0.406'),
std=('0.229', '0.224', '0.225')
)
)
(conv1_1): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu1_1): ReLU(inplace=True)
(conv1_2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu1_2): ReLU(inplace=True)
(pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv2_1): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu2_1): ReLU(inplace=True)
(conv2_2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu2_2): ReLU(inplace=True)
(pool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv3_1): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu3_1): ReLU(inplace=True)
(conv3_2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu3_2): ReLU(inplace=True)
(conv3_3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu3_3): ReLU(inplace=True)
(conv3_4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu3_4): ReLU(inplace=True)
(pool3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv4_1): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu4_1): ReLU(inplace=True)
(conv4_2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu4_2): ReLU(inplace=True)
(conv4_3): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu4_3): ReLU(inplace=True)
(conv4_4): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu4_4): ReLU(inplace=True)
(pool4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv5_1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu5_1): ReLU(inplace=True)
(conv5_2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu5_2): ReLU(inplace=True)
(conv5_3): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu5_3): ReLU(inplace=True)
(conv5_4): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(relu5_4): ReLU(inplace=True)
(pool5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)

Perceptual loss

The content and style loss are defined in pystiche as operators. For this tutorial we use the ops.FeatureReconstructionOperator as content_loss, which compares the encodings directly. If the encoder was trained for classification task, as is the case here, these encodings represent the content. As content_layer we chose one deep into the multi_layer_encoder to get an abstract content representation rather than many unnecessary details.

Content loss.

We use the ops.GramOperator as base for the style_loss. It discards spatial information by comparing the correlation of the individual channels of encodings. This enables the synthesis of style elements everywhere in the output image rather than only where they are located in the style image. The ops.GramOperator performs best if it operates on shallow as well as deep style_layers alike.

The style_weight enables us to control the focus on content or style in the output image.

For convenience, we wrap everything in a ops.MultiLayerEncodingOperator, which handles the case for operators of the same type that operate on multiple layers of the same multi_layer_encoder.

Style loss.

The loss.PerceptualLoss combines the content_loss and style_loss and will serve as criterion for the optimization.

Perceptual loss.
PerceptualLoss(
(content_loss): FeatureReconstructionOperator(
score_weight=1,
encoder=VGGMultiLayerEncoder(
layer=relu4_2,
arch=vgg19,
framework=torch,
allow_inplace=True
)
)
(style_loss): MultiLayerEncodingOperator(
encoder=VGGMultiLayerEncoder(
arch=vgg19,
framework=torch,
allow_inplace=True
),
score_weight=1000
(relu1_1): GramOperator(score_weight=0.2)
(relu2_1): GramOperator(score_weight=0.2)
(relu3_1): GramOperator(score_weight=0.2)
(relu4_1): GramOperator(score_weight=0.2)
(relu5_1): GramOperator(score_weight=0.2)
)
)

Images

We now load and register the target images that will be used in the NST. We resize them to 500 pixels, since an NST is quite memory intensive.

Content image.
Style image.

Neural Style Transfer

As a last preliminary step we create the input_image. We start the NST from the content_image since this way it converges quickly.

The image_optimization function is used for convenience but can be replaced by a manual optimization loop without restrictions. If not specified, as is the case here, torch.optim.LBFGS is used as optimizer.

Neural Style Transfer.
Output image.

Conclusion

The usage example hopefully showed that pystiche makes the definition of the perceptual loss, i.e. the core concept of an NST, easy. Combined with the provided periphery, we hope that pystiche establishes itself as backbone for upcoming NST research. If this story peaked your interest you can head over to our other usage examples.

That being said, as of now, the built-in functionality is limited to only the most common techniques. In the future we plan to expand this preferably with the support of the community. Additionally we are currently working on pystiche_papers: a package that provides reference implementations for older NST papers to ease the comparison with new concepts.

--

--