NeuroNuggets: CVPR 2018 in Review, Part I

Here at Neuromation, we are always on the lookout for new interesting ideas that could help our research. And what better place to look for them than top conferences! We have already written about our success at the DeepGlobe workshop for the CVPR (Computer Vision and Pattern Recognition) conference. This time we will take a closer look at some of the most interesting papers from CVPR itself. These days, top conferences are very large affairs, so prepare for a multi-part post. The papers are in no particular order and chosen not only for standing out among the crowd but also for relevance to our own studies that we do at Neuromation. This time, Aleksey Artamonov (whom you have met before) prepared the list, and I tried to supply some text around it. In this series, we will be very brief, trying to extract at most one interesting point from each paper, so in this format we cannot really do these works justice and wholeheartedly recommend to read the papers in full.

GANs and Computer Vision

In the first part, we concentrate on generative models, that is, machine learning models that can not only tell cats and dogs apart on a photo but also can produce new images of cats and dogs. For computer vision, the most successful class of generative models are generative adversarial networks (GAN), where a separate discriminator network learns to distinguish between generated objects and real objects, and the generator learns to fool the discriminator. We have already written about GANs several times (e.g., here and here), so let’s jump right into it!

Finding Tiny Faces in the Wild

Y. Bai et al., Finding Tiny Faces in the Wild with Generative Adversarial Network

In this collaboration between Saudi and Chinese researchers, the authors use a GAN to detect and upscale very small faces on photographs of large crowds. Even just detecting small faces is an interesting problem that regular face detectors (that appear, e.g., in our previous post) usually fail to solve. And here the authors propose an end-to-end pipeline to extract faces and then apply a generative model to upscale it up to 4x (a process known as superresolution). Here is the pipeline overview from the paper:

PairedCycleGAN for Makeup

H. Chang et al., PairedCycleGAN: Asymmetric Style Transfer for Applying and Removing Makeup

Conditional GANs are already widely used for image manipulation; we have mentioned superresolution, but GANs also succeed for style transfer. With GANs, one can learn salient features that correspond to specific image elements — and then change them! In this work, researchers from Princeton, Berkeley and Adobe present a framework for makeup modification on photos. One interesting part of this work is that the authors train separate generators for different facial components (eyes, lips, skin) and apply them separately, extracting facial components with a different network:

GANerated Hands

F. Mueller et al., GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB

We have already written about pose estimation in the past. One very important subset of pose estimation, which usually requires separate models, is hand tracking. The sci-fi staple of manipulating computers by waving your hands is yet to be fully realized and still requires specialized hardware such as Kinect. As usual, one of the main problems is data: where can you find real video streams of hands labeled in 3D?.. In this work, the authors present a conditional GAN architecture that is able to convert synthetic 3D models of hands to photorealistic images that are then used to train the hand tracking network. This work is very close to our heart as synthetic data is the main emphasis of our studies at Neuromation, so we will likely consider it in more detail later. Meanwhile, here is the “synthetic-to-real” GAN architecture:

Person Transfer GAN

L. Wei et al., Person Transfer GAN to Bridge Domain Gap for Person Re-Identification

Person re-identification (ReID) is the problem of finding the same person on different photos taken in varying conditions and under varying circumstances. This problem has, naturally, been the subject of many studies, and it is relatively well understood by now, but the domain gap problem still remains: different datasets with images of people still have very different conditions (lighting, background etc.), and networks trained on one dataset lose a lot in the transfer to another dataset (and also to, say, a real world application). The picture above shows what different datasets look like. To solve this problem, this work proposes a GAN architecture able to transfer images from one “dataset style” to another, again using GANs to augment real data with complex transformations. It works like this:

Eye Image Synthesis with Generative Models

K. Wang et al., A Hierarchical Generative Model for Eye Image Synthesis and Eye Gaze Estimation

This work from the Rensselaer Polytechnic Institute attacks a very specific problem: generating images of human eyes. This is important not only to make beautiful eyes in generated images but also, again, to use generated eyes to work backwards and solve the gaze estimation problem: what is a person looking at? This would pave the way to truly sci-fi interfaces… but that’s still in the future, and at present even synthetic eye generation is a very hard problem. The authors present a complex probabilistic model of eye shape synthesis and propose a GAN architecture to generate eyes according to this model — with great success!

Image Inpainting: Fill in the Blanks

J. Yu et al., Generative Image Inpainting with Contextual Attention

This work from Adobe Research and University of Illinois at Urbana-Champaign is devoted to the very challenging problem of filling in the blanks on an image (see examples above). Usually, inpainting requires understanding of the underlying scene: in the top right on the picture above, you have to know what a face looks like and what kind of face is likely given the hair and neck that we see. In this work, the authors propose a GAN-based approach that can explicitly make use of the features from the surrounding image to improve generation. The architecture consists of two parts, first generating a coarse result and then refining it with another network. And the results are, again, very good:

Well, that’s it for today. This is only part one, and we will certainly continue the CVPR 2018 review in our next installments. See you around!

Sergey Nikolenko
Chief Research Officer, Neuromation

Aleksey Artamonov
Senior Researcher, Neuromation