New Advances in Generative Adversarial Networks, or a Comment on (Karras et al., 2017)

by Chief Research Officer Sergey Nikolenko

A very recent paper by NVIDIA researchers has stirred up the field of deep learning a little. Generative adversarial networks, which we will talk about below, have already been successfully used in a number of important problems, and image generation was always at the forefront of these applications. However, the work by Karras et al. presents a fresh take on the old idea of generating an image step by step, gradually enhancing the image (for example, increasing its resolution) as they go. To explain what is going on here, I will have to step back a little first.

Generative adversarial networks (GANs) are a class of neural networks that aim to learn to generate objects from a certain class, e.g., images of human faces or bedroom interiors (a popular choice for GAN papers due to a commonly used part of the standard LSUN scene understanding dataset). To perform generation, GANs employ a very interesting and rather commonsense idea. They have two parts that are in competition with each other:

  • the generator aims to, well, generate new objects that are supposed to pass for “true” data points;
  • the discriminator aims to distinguish between real data points and the ones produced by the generator.

In other words, the discriminator learns to spot the generator’s counterfeit images, while the generator learns to fool the discriminator. I refer to, e.g., this post for a simple and fun introduction to GANs.

We at Neuromation are following GAN research with great interest due to many possible exciting applications. For example, conditional GANs have been used for image transformations with the explicit purpose of enhancing images; see, e.g., image de-raining recently implemented with GANs in this work. This ties in perfectly with our own ideas of using synthetic data for computer vision: with a proper conditional GAN for image enhancement, we might be able to improve synthetic (3D-rendered) images and make them more like real photos, especially in small details. We are already working on preliminary experiments in this direction.

This work by NVIDIA presents a natural idea: grow a large-scale GAN progressively. The authors begin with a small network able to produce only, e.g., 4x4 images, train it until it works well (on viciously downsampled data, of course), then add another set of layers to both generator and discriminator, moving from 4x4 to 8x8, train the new layers, and so on. In this way, they have been able to “grow” a GAN able to generate very convincing 1024x1024 images, of much better quality than before.

The idea of progressively improving generation in GANs is not completely novel; for example,

  • Chen & Koltun present a cascaded refinement approach that aims to bring small generated images up to megapixel size step by step;
  • the well-known StackGAN model by Zhang et al. constructs an intermediate low-dimensional representation and then improves upon it in another GAN;
  • and the idea can be traced as far back as 2015, soon after the introduction of GANs themselves, when Denton et al. proposed a pyramid scheme for coarse-to-fine generation.

However, all previous approaches made their progressive improvements separately: the next level of progressive improvement simply took the result of the prevoius layers (plus possibly some noise). In Karras et al., the same idea is executed in a way reminiscent of unsupervised pretraining: they train a few layers, then add a few more, and so on. It appears that this execution is among the most straightforward and fastest to train, but at the same time among the best in terms of results. See for yourself:

Naturally, we are very excited about this advance, which brings image generation, which was first restricted to small pictures (from 32x32 to 256x256 pixels), ever closer to a size suitable for practical use. In my personal opinion, GANs (specifically conditional GANs) may be the exact architecture we need to make synthetic data in computer vision indistinguishable from real data.