Real-Time Style Transfer

Artistic stylization on real-time video

5 min readMay 21, 2019

Introduction

This post describes a system by David Bush, Chimezie Iwuanyanwu, Johnathon Love, Ashar Malik, Ejeh Okorafor, and Prawal Sharma that is capable of implementing style transfer on real-time video. We will begin by discussing relevant background information related to style transfer in the background section. Then, we will discuss the model being used and any methodology related to training. Following this section, we will discuss implementation details in the implementation section. Finally, we will conclude with the results and future steps section. In the results section, we will showcase examples of style transfers. And in the future steps section, we will elaborate on possible improvements to the system. The GitHub repository for the code can be found here.

Background

Style transfer, more specifically neural style transfer (NST), is a neural network technique where the style of one image is applied to the content of another image. Neural style transfer was first described in the paper A Neural Algorithm of Artistic Style in 2015. In practice, a pre-trained convolutional neural network (CNN) can have its feature representations extracted. This creates isolated representations of content and style for an image. With the content and style separated, we can then optimize generation of a new image that combines the isolated style with the contents of a new image.

To do style transfer on videos, the simplest approach is to split a video up into its successive frames and then apply the style transfer individually to each frame image. This simple method results in challenges in ensuring the style is applied consistently in progressive frames. However, due to its simplicity, we decided to go with this method compared to other more sophisticated methodologies.

A major challenge also includes the real-time constraint of stylizing. The neural network must be capable of quick processing — latency must be low and we must be able to sustain a moderate frames per second (FPS).

Model

The model we are using is based on the work done by Justin Johnson, Alexandre Alahi, and Li Fei-Fei in Perceptual Losses for Real-Time Style Transfer and Super-Resolution. The model implemented is a deep neural network that consists of the following layers:

Three convolutional layers
Five residual blocks
Three deconvolutional layers

Our system is based on the GitHub repository Chainer Fast Neural Style which uses the Python module chainer to pipeline data during the training and prediction time. Additionally, we used VGG16 to derive content and style representations during training.

Training

The diagram above illustrates how training is accomplished. Notably, there are two loss functions calculated: content loss and style loss. Content loss is the difference between the content (pixel values) in the style network and the input image. Style loss is the difference between in style, calculated by the gram matrix, between the style network and the style image. The sum of the content and style losses are minimized during training time.

Each specific style must have its own neural network trained to apply the style transfer. After completion of training, the style network is equipped to take any input image and forward propagate to generate a stylized version of the image. Due to the relatively large size of the network, training each style takes around 2 hours per epoch using a NVIDIA P4000 on Paperspace. A GitHub repository has compiled pre-trained networks for certain styles.

Implementation

The implementation is relatively straightforward. We first train the neural network on styles we are interested in. Next, we take input from the webcam in the form of a streaming video. Then, we sample the video stream to acquire still images. Finally, we stylize and display the webcam images. The end result is a live reflection of what is seen in the webcam with a style applied on top.

One main problem faced in stylizing streamed videos is that they suffer from “popping”: inconsistencies in style from frame to frame. In order to fix the problem of popping and create a smoother transition, we stabilize the input during our training process. Described in Element AI’s post, we added noise to each image during training and minimized the difference between the styled version of our image and the noisy image. This results in a more stable stylization between video frames.

Popping stream (left) and Fixed Stream (right)

Results

The results we achieved reflect what we expected. We were able to successfully take multiple styles of our choosing, train a plethora of neural networks to identify and learn the different styles, and then we applied the styles onto a plethora of images of our choosing. The training time took around two hours per style. After getting the baseline neural networks working, we proceeded to hook the neural networks up to a feed from the webcam. Although we expected abysmal results due to this work being done on a laptop, we managed to stylize 3 frames per second using a laptop dedicated GPU.

Different Style Transfers on Webcam Video Feed

Future Steps

The main drawback of our results is simply the time taken to generate images. Currently, it takes 0.3 seconds to take an input image and apply the style to it. This is acceptable as a proof of concept, but to truly get a “live filter” we would need to strive for at least 24 FPS. Avenues that can be explored include decreasing the resolution of the video, experimenting with the architecture of the neural network, improving the underlying hardware, or looking into utilizing cloud infrastructure to outsource the computation to a server farm.