Stabilizing neural style-transfer for video

Using noise-resilience for temporal stability in style transfer

Published in

Element AI Lab

5 min readFeb 12, 2018

by Jeffrey Rainy and Archy de Berker

Style transfer — using one image to stylize another — is one of the applications of Deep Learning that made a big impact in 2017. In this post we discuss the challenges of taking style transfer from still images to real-time video. In the companion piece, we give an overview of Element AI’s video style transfer system, Mur.AI.

The original paper A Neural Algorithm of Artistic Style (Gatys, Ecker, and Bethge; 2015) presented a technique for learning a style and applying it to other images. Briefly, they use gradient descent from white noise to synthesize an image which matches the content and style of the target and source image respectively. The content and style representations they seek to match are derived from features in VGG16 CNN from Oxford University. For more background on style transfer, see our piece on Mur.AI.

The now iconic examples from Figure 2 of Gatys et al (2015).

However, when used frame-by-frame on movies, the resulting stylized animations are of low quality. Subjectively, they suffer from extensive “popping”: inconsistent stylization from frame to frame. The stylized features (lines, strokes, colours) are present one frame but gone the next frame:

Style transfer for video

One solution to the problems with the original method is suggested in a subsequent paper, by Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox titled Artistic style transfer for videos (2016). They present a method of obtaining stable video by penalizing departures from the optical flow of the input video. Style features remain present from frame to frame, following the movement of elements in the original video. However, the method is computationally far too heavy for real-time style-transfer, taking minutes per frame.

[Removed previously embedded video to https://www.youtube.com/watch?v=Khuj4ASldmU&t=10s as it misrepresented our contribution. This video is from a different approach relying, among other things, on Optical Flow and long-term consistency]

We decided to start with a faster option, from Johnson, Alahei, and Fei-Fei’s 2016 paper Perceptual Losses for Real-Time Style Transfer and Super-Resolution. They train another ConvNet to approximate the time-consuming pixel-level gradient descent performed by Gatys et al. The result is much quicker to run, but naively applied to videos, produces the same “popping” problems discussed above.

Our approach

Our implementation combines the innovations of Johnson et al and Ruder et al to produce a fast style-transfer algorithm which significantly reduces the effect of popping on the learned style whilst working in real-time. The stabilization is done at training time, allowing for an unruffled style transfer of videos in real-time. The stability difference is readily visible:

The central idea employed here is that temporal instability and popping result from the style changing radically when the input changes very little. In fact, the changes in pixel values from frame-to-frame are largely noise.

We can therefore impose a specific loss at training time: by manually adding a small amount of noise to our images during training and minimizing the difference between the stylized versions of our original and noisy images, we can train a network for stable style-transfer.

Modifications to the training

We started with the implementation Chainer Fast Neural Style, which we’ll refer to as CFNS from now on.

Our modifications to the training code are located in a fork of CFNS.

We call a model stable if adding noise to some pixels in a source image results in a similar stylization. The gist of our improvement to the training is to add a loss function that captures how unstable our model is. The vanilla CFNS loss function is computed as follows:

We’ve added a fourth loss component at line 177:

Where lambda_noiseis a tuning parameter and noisy_yis the stylization of the source image, yy, with noise added at line 146:

The noise image we add is zero everywhere except on noise_count pixels where it is uniformly distributed in [−noise_range,noise_range], providing two more hyperparameters.

Noise hyperparameters

We explored the space of these three parameters — noise_range, noise_count, and lambda_noise (the weight given to the noise term when computing loss).

Varying the hyper parameters noise_range, noise_count, and lambda_noise results in subtly different style transfer

On training images of size 512x512, we found that the values that work best for our styles to be:

lambda_noise = 1000
noise_range = 30
noise_count = 1000

Interestingly, we found that it seems unnecessary to regenerate a different noise image for each mini-batch. Training with the same noise image works just as well, in addition to being faster.

If you’re training your own networks, note that:

noise_count should be scaled with the square of the image size. 0.3 % of the pixels worked best for us.
noise_count was chosen empirically so that the loss contribution from L_pop was around 10% of the total loss. We wanted stability to be important, but not more important than the style itself.

Take aways

Overall, it is impressive that such a small change to the training (minimizing the style difference caused by noisy inputs) resulted in such a big improvement in the final quality of the stylized video. We had expected the stylization quality to degrade sharply as it traded-off against stability. However, what we observed with reasonable parameters is that the style quality remains while the popping can be easily reduced.

The one case we found where the stabilization visibly affects the style is when large areas of the same colour are present. In the Tubingen sample image, for example, the large patch of blue sky cannot be stably stylized. If the network stylized the sky with patterns, they would necessarily change from frame to frame as there’s nothing for them to latch on to. As such, the stabilized version of the style learned to fill the sky with significantly less texture: