Introducing Mur.AI

Real-time neural style transfer for video

by Jeffrey Rainy and Archy de Berker

Mur.AI is a setup for real-time stylized video. It takes a reference style image, such as a painting, and a video stream to process. The video stream is then processed so that it has the style of the reference image. The stylization is done in real-time, producing a stylized video stream.

In a previous post, we described a technique to stabilize style transfer so that it works well for videos. We’ve subsequently deployed those techniques to produce a system which we’ve demoed at a variety of conferences, including C2, Art Basel and NIPS. You can check out some of the resulting selfies on Twitter.

In this post, we present some high-level explanations of how the system works. We’ll provide a primer on style transfer and convolution, and detail some of the engineering challenges we overcame to deploy our algorithm to the real world.

What is style transfer?

Style transfer is the application of the artistic style of one image to another image:

Here the source image is of two developers (Phil Mathieu and Jean Raby) and the style comes from Notre Dame de Grace by A’Shop, a mural in Montreal.

The technique was first introduced in the paper A Neural Algorithm of Artistic Style by Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Since then we’ve seen style transfer applied all over the place, and a variety of improvements to the original algorithm. In particular, in our implementation we build upon the work Perceptual Losses for Real-Time Style Transfer and Super-Resolution by Justin Johnson, Alexandre Alahi, and Fei-Fei Li, which provides a faster solution for style transfer which is more readily applicable to video.

How does it work?

Each frame of the video that is stylized goes through a Convolutional Neural Network (CNN) and is then displayed on screen. For a beautiful introduction to convolutional networks, see Chris Olah’s post and for a visual guide to convolution, see Vincent Dumoulin’s GitHub repo.

Briefly, a CNN performs multiple convolution operations on an image to obtain another one. Each convolution is an operation on the pixels in a square region of your image. In a CNN, the same operation is repeated all over the image to compute each pixel. The diagram below illustrates the processing done by a single convolution:

At deployment time, we use a single CNN, which we’ll call the stylization CNN, to produce our stylized image. When we talk about “learning a style,” we are talking about finding the parameters of the stylization network so that it produces the right style.

In order to find those parameters, we’re going to use a second network: a classification CNN. We use this as a feature extractor to provide representations of the style and the content of our input images.

Classification Network

A classification CNN takes an image and tries to identify what the image contains. For example if you had some photos of cats and dogs, you might train a classification CNN to figure out which ones are which. In the classifier CNN we use, the task is to classify small images into one of 1000 categories, a hugely popular task named ImageNet.

The classifier network performs multiple convolutions on an image to produce features useful for classification. As an image goes through the network, it becomes smaller and smaller in size (pixels) but it also grows in terms of components per pixel. From RGB input (3 components) at full resolution, it iteratively shrinks the image to a single pixel, but with many components: the probability of the image representing each of many categories.

A typical schema for a convolutional neural network for classification.

Although the network is originally trained for classification, we’re not going to use it for that. An attractive feature of CNNs is that they naturally recapitulate the hierarchies that exist in natural images. Earlier in the network, we capture small-scale features such as edges. Later in the network, we capture larger-scale features, like whole objects. You can explore this phenomenon in your browser here.

Imagine feeding a picture of yourself in pajamas to the network. After the first layers, the information the network processes would map to local features (“thin vertical blue stripes”) whereas the last layers of the network would capture features that describe the picture as a whole (“someone standing”). Thus, early layers capture the style of an image, whilst the features learned by late layers capture the content.

Some examples of the kind of features that different layers in a CNN prefer (we move deeper in the network as we move left to right). From Olah, et al., “Feature Visualization”, Distill, 2017.

This project relies on the pre-trained VGG16 network, from the University of Oxford, for classification. This provides one of the two CNNs we are going to need to perform style transfer.

Stylization Network

Our stylization network does the job of actually producing the stylized images. We’re going to learn parameters which allow us to do this by using the classifier network as a training tool.

To do so, we take a large set of images for training and repeat the following steps:

  1. We feed the classifier CNN with Style image, Source image, and Stylized image from current stylization CNN
  2. We extract representations of style and content from the classifier network
  3. We adjust the stylization CNN so that the stylized image has style that more closely resembles the style image, and content that resembles the source image

In pseudo-python-code:

And pictorially:

Capturing a style consists of keeping the small-scale features of the Style image while keeping the large-scale features of the Source image. In a nutshell, we’re looking for the stylization CNN that maintains “blue and brown swirling colors” of style while maintaining “two programmers, a column and a bunch of desks” of source.

To capture this, we compute the difference between Style and Stylized early in the network and Source and Stylized later in the network. Our loss function — the thing we’re trying to minimize by changing the parameters of our stylization network — is the sum of these two terms. In fact, as detailed in our previous post, we incorporated an extra stability-loss term which helped generate stable style transfer from frame-to-frame. You can find the code for our updated implementation here.

All told, our implementation now trains a new style from a 512 x 512 pixel image in ~6 hours, utilizing 4 GPUs.

Deploying our system

Our demo has been deployed on-screen around the world, and has even been projected onto buildings here in Montreal (as part of the Up375 project).

We faced several challenges in deploying the demo to run in real-time. The main issues were throughput and latency: how could we take the video, run it through our model, then render it again in near real-time?

The finished system runs on a Zotac EN1070 minicomputer with an NVIDIA GeForce 1070 GPU, and is easily portable

H264 decoding on the GPU

The camera we use (Logitech C920) outputs pre-compressed H264 video. The naive approach would be to decode this video with FFmpeg on the GPU, to bring the RGB pixels back on CPU and upload them again as input for the CNN.

However, CPU-GPU transfer turned out to add significant latency, by forcing synchronization points between CPU and GPU when copying occurs. The solution was to decode the H264 video directly with the onboard NVDEC engine (a hardware-based accelerated decoder that is included in the Zotac EN1070). This allowed decoded frames to be passed directly as inputs to our CNN, allowing GPU and CPU to allow to run fully asynchronously.

OpenGL rendering from the GPU

Having decoded our video and run our network on the GPU, we faced a similar bottleneck in passing the resultant matrix back to the CPU for conventional rendering. Again, the solution was on-GPU computation. Using the CUDA/OpenGL Interop API, we can render the outputs to screen straight from the GPU and avoiding further I/O bottlenecks.

Twitter integration

The demo booth integrates functionality to publish stylized images to Twitter. The feed can be seen at @ElementAIArt.

Styles used

We trained stylization networks for a variety of our favourite murals around Montreal.

For each style, we cropped a section that had the right characteristics we wished the network to learn. Below, you’ll find the whole work and the cropped sections we used for training. You can see some of these samples in action in the video at the start of this post.

Notre Dame de Grâce, by A’Shop. 6310 rue Sherbrooke Ouest. Spray paint on brick, 2011.
Sans titre, by Bicicleta Sem Freio. 3527 boulevard St-Laurent. Spray paint on brick, 2015
Sans titre, by El Curiot. 265 rue Sherbrooke Ouest. Acrylic on brick, 2015
Quai des Arts, by EN MASSE. 4890 boulevard St-Laurent. Spray paint on brick, 2011
Galaktic Giant, Chris Dyer. 3483 rue Coloniale. Spray paint on brick, 2013
Sans titre, by David ‘Meggs’ Hook. 3527 boulevard St-Laurent. 2016
Autumn Foliage Series #1 by Matt W. Moore. 4660 boulevard St-Laurent. 2014
Mémoire du coeur by Julian Palma. 4505 rue Notre-Dame Ouest. Spray paint on brick, 2016
Sans titre, by SBuONe. 4243 boulevard St-Laurent. 2016
Sans titre, by Zilon. 53 rue Marie-Anne. Spray paint on various medium

Acknowledgments

This work was carried out by Jeffrey Rainy, Eric Robert, Jean Raby and Philippe Mathieu, with support from the team at Element AI. Thanks to Xavier Snelgrove for comments on the post.

It relied upon the open-source codebase of Faster Neural Style Transfer by Yusuketomoto, which is an implementation of Perceptual Losses for Real-Time Style Transfer and Super-Resolution by Justin Johnson, Alexandre Alahi, and Li Fei-Fei.