by Jeffrey Rainy and Archy de Berker
Mur.AI is a setup for real-time stylized video. It takes a reference style image, such as a painting, and a video stream to process. The video stream is then processed so that it has the style of the reference image. The stylization is done in real-time, producing a stylized video stream.
In a previous post, we described a technique to stabilize style transfer so that it works well for videos. We’ve subsequently deployed those techniques to produce a system which we’ve demoed at a variety of conferences, including C2, Art Basel and NIPS. You can check out some of the resulting selfies on Twitter.
In this post, we present some high-level explanations of how the system works. We’ll provide a primer on style transfer and convolution, and detail some of the engineering challenges we overcame to deploy our algorithm to the real world.
What is style transfer?
Style transfer is the application of the artistic style of one image to another image:
The technique was first introduced in the paper A Neural Algorithm of Artistic Style by Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Since then we’ve seen style transfer applied all over the place, and a variety of improvements to the original algorithm. In particular, in our implementation we build upon the work Perceptual Losses for Real-Time Style Transfer and Super-Resolution by Justin Johnson, Alexandre Alahi, and Fei-Fei Li, which provides a faster solution for style transfer which is more readily applicable to video.
How does it work?
Each frame of the video that is stylized goes through a Convolutional Neural Network (CNN) and is then displayed on screen. For a beautiful introduction to convolutional networks, see Chris Olah’s post and for a visual guide to convolution, see Vincent Dumoulin’s GitHub repo.
Briefly, a CNN performs multiple convolution operations on an image to obtain another one. Each convolution is an operation on the pixels in a square region of your image. In a CNN, the same operation is repeated all over the image to compute each pixel. The diagram below illustrates the processing done by a single convolution:
At deployment time, we use a single CNN, which we’ll call the stylization CNN, to produce our stylized image. When we talk about “learning a style,” we are talking about finding the parameters of the stylization network so that it produces the right style.
In order to find those parameters, we’re going to use a second network: a classification CNN. We use this as a feature extractor to provide representations of the style and the content of our input images.
A classification CNN takes an image and tries to identify what the image contains. For example if you had some photos of cats and dogs, you might train a classification CNN to figure out which ones are which. In the classifier CNN we use, the task is to classify small images into one of 1000 categories, a hugely popular task named ImageNet.
The classifier network performs multiple convolutions on an image to produce features useful for classification. As an image goes through the network, it becomes smaller and smaller in size (pixels) but it also grows in terms of components per pixel. From RGB input (3 components) at full resolution, it iteratively shrinks the image to a single pixel, but with many components: the probability of the image representing each of many categories.
Although the network is originally trained for classification, we’re not going to use it for that. An attractive feature of CNNs is that they naturally recapitulate the hierarchies that exist in natural images. Earlier in the network, we capture small-scale features such as edges. Later in the network, we capture larger-scale features, like whole objects. You can explore this phenomenon in your browser here.
Imagine feeding a picture of yourself in pajamas to the network. After the first layers, the information the network processes would map to local features (“thin vertical blue stripes”) whereas the last layers of the network would capture features that describe the picture as a whole (“someone standing”). Thus, early layers capture the style of an image, whilst the features learned by late layers capture the content.
This project relies on the pre-trained VGG16 network, from the University of Oxford, for classification. This provides one of the two CNNs we are going to need to perform style transfer.
Our stylization network does the job of actually producing the stylized images. We’re going to learn parameters which allow us to do this by using the classifier network as a training tool.
To do so, we take a large set of images for training and repeat the following steps:
- We feed the classifier CNN with Style image, Source image, and Stylized image from current stylization CNN
- We extract representations of style and content from the classifier network
- We adjust the stylization CNN so that the stylized image has style that more closely resembles the style image, and content that resembles the source image
Capturing a style consists of keeping the small-scale features of the Style image while keeping the large-scale features of the Source image. In a nutshell, we’re looking for the stylization CNN that maintains “blue and brown swirling colors” of style while maintaining “two programmers, a column and a bunch of desks” of source.
To capture this, we compute the difference between Style and Stylized early in the network and Source and Stylized later in the network. Our loss function — the thing we’re trying to minimize by changing the parameters of our stylization network — is the sum of these two terms. In fact, as detailed in our previous post, we incorporated an extra stability-loss term which helped generate stable style transfer from frame-to-frame. You can find the code for our updated implementation here.
All told, our implementation now trains a new style from a 512 x 512 pixel image in ~6 hours, utilizing 4 GPUs.
Deploying our system
We faced several challenges in deploying the demo to run in real-time. The main issues were throughput and latency: how could we take the video, run it through our model, then render it again in near real-time?
H264 decoding on the GPU
The camera we use (Logitech C920) outputs pre-compressed H264 video. The naive approach would be to decode this video with FFmpeg on the GPU, to bring the RGB pixels back on CPU and upload them again as input for the CNN.
However, CPU-GPU transfer turned out to add significant latency, by forcing synchronization points between CPU and GPU when copying occurs. The solution was to decode the H264 video directly with the onboard NVDEC engine (a hardware-based accelerated decoder that is included in the Zotac EN1070). This allowed decoded frames to be passed directly as inputs to our CNN, allowing GPU and CPU to allow to run fully asynchronously.
OpenGL rendering from the GPU
Having decoded our video and run our network on the GPU, we faced a similar bottleneck in passing the resultant matrix back to the CPU for conventional rendering. Again, the solution was on-GPU computation. Using the CUDA/OpenGL Interop API, we can render the outputs to screen straight from the GPU and avoiding further I/O bottlenecks.
The demo booth integrates functionality to publish stylized images to Twitter. The feed can be seen at @ElementAIArt.
We trained stylization networks for a variety of our favourite murals around Montreal.
For each style, we cropped a section that had the right characteristics we wished the network to learn. Below, you’ll find the whole work and the cropped sections we used for training. You can see some of these samples in action in the video at the start of this post.
This work was carried out by Jeffrey Rainy, Eric Robert, Jean Raby and Philippe Mathieu, with support from the team at Element AI. Thanks to Xavier Snelgrove for comments on the post.
It relied upon the open-source codebase of Faster Neural Style Transfer by Yusuketomoto, which is an implementation of Perceptual Losses for Real-Time Style Transfer and Super-Resolution by Justin Johnson, Alexandre Alahi, and Li Fei-Fei.