Why Stylistic GANs are So Deceptive

Published in

Intuition Machine

7 min readDec 14, 2018

NVidia shocked the world again by its release of A Style-Based Generator Architecture for Generative Adversarial Network (GAN). A year previously, the same team (Tero Karras, Samuli Laine, Timo Aila) released the equally shocking Progressive Growing of GANs. Here’s the video demonstrating their results:

The question we should all ask is, why does this work so well?

Let’s break this question up even further.

(1) Why do we know it works so well? We know because as humans we can see the results and we know that it is deceivingly difficult to tell if it is not real. Not many problems have this feature where human visual inspection can tell you how good your solution works.

(2) What does the GAN use as a measure of what a deceivingly real image looks like? It actually doesn’t have an explicit measure defined by a programmer. It’s just to complex a job to explicitly figure out the object function to use to render a realistic image of a person. So, the solution is to use a Discriminator neural network that is trained to distinguish real from fake images. Note: A GAN is composed of a Generator and Discriminator neural network. The Discriminator network is used in training time and is usually discarded when generating.

(3) How does the GAN learn how to generate realistic images? The GAN’s Generator learns how to fool the GAN’s Discriminator through a competitive game, an arms race of gradually increasing its capability versus the Discriminator’s capability. This gameplay between two neural networks should also remind one of the self-play found in AlphaZero. What are the conditions for self-play to work? Aside from having an opponent that is just incrementally better than the current agent, an agent should be able to leverage its current’s knowledge to make the next step higher. Said differently, the skills required to graduate to the next level must be available from the skills presently available. This doesn’t seem to be true in general.

(4) How do we know that the system is not perfect? We know because we see artifacts that don’t exist in real images. For example, we can see that the right side of the dress isn’t the same as the left side. An earring on the left isn’t the same as the right. The rendering of the hair is a bit fuzzy. We are better at discrimination than the GAN’s discriminator. Interestingly enough, we are not better generators as compared to the GAN generator! Only extremely skilled artists specializing in authentic recreation can create the fidelity generated by a GAN generator.

Let’s pause for a moment here to realize that the entire reasoning is circular, this is the kind of ‘inversion of reasoning’ that Daniel Dennett describes so often. This is what he would call: “Capability without comprehension”. Darwin’s Evolution and Turing’s universal machines are what Dennett provides as examples of this. In more nebulous terms, this is what is known as emergent phenomena. This is where the interactions of many simpler participants lead to group behavior that is difficult to predict. GANs require a massive number of participants and a massive number of iterations ( A week on an NVIDIA DGX-1 with 8 Tesla V100 GPUs).

Daniel Dennett describes information semantics as “the difference that makes a difference”. It is important to note that Shannon’s information only describes capacity. Here also we see the same circular reasoning. Semantics is what is important to the intentional agent that is the receiver of information. So, we happen to know that GANs are very effective because we can see how effective they are. We are passing our own intuitive judgment on how good they are. We see the differences that do make a difference. We see that we are fooled by these images and this is the difference that is significant. Humans are the ultimate objective functions of these kinds of GANs. We don’t have the same luxury for other problems.

The reasoning is all circular, or rather self-referential. I’ve described these architectures previously as reflecting Douglas Hofstadter’s Strange Loop. It’s an interesting analogy, but it still doesn’t tell us how something without comprehension creates something that appears to require comprehension.

How does a GAN bootstrap its competence?

We find the idea of a bootstrap perplexing because we intuitively know that it takes work to learn anything. Work according to physics requires a directional constraint. So for examples, the work to push a block a fixed distance requires information of a starting point and ending point and therefore a directional constraint. There exists an asymmetry in information discovery. How then can learning be bootstrapped if there is no directional information?

Biological life and artificial neural networks are both intentional agents. That is, they behave based on learned cognitive algorithms that have been forged by adapting to their respective ecosystems. Furthermore, the cognitive capabilities of an intentional agent are bounded and will always capture less information than what is intrinsically present in its ecosystem. The ecosystem maintains considerably more information to what is available to an agent.

To illustrate this, how does an ant navigate a rough terrain? An ant doesn’t need it to memorize its terrain, rather it is reacting instinctively and impulsively to the shape of the train that it observes locally. The terrain contains all of the information and it is up to the ant to filter out the difference that makes a difference. That is, focusing on the information from the terrain that tells it how to make the next step.

Like millions of ants that navigate a terrain, a Generator learns to generate the space of real and fake images because the ecosystem (i.e. the Discriminator acting as a proxy) filters this information into the “difference that makes a difference” (i.e. fake or real). There are other differences that may be important and there are research papers that have explored exploiting richer information from a Discriminator.

However, one needs to make a balance between the bounded rationality of an agent and the complexity of its ecosystem. In other words, an intentional agent can only thrive in an ecosystem that it can adapt to. Its capabilities must be compatible with the complexities introduced by its own ecosystem. Low fidelity GANs or VAEs work well because there is sufficient computational power required to explore and generate low-resolution images. Their capability matches the complexity of the task at hand.

The trick that Progressive GANs introduce is that it introduces a curriculum such that generating lower resolution images are learned before learning higher resolution images. These GANs are capable of high-resolution image generation because it has been incrementally bootstrapped to learn lower resolution image generation. This is Darwin’s gradualism in full effect. In the graphic below, a Generator and Discriminator networks are first initialized as a 4x4 image and overtraining progress to generating 1024x1024 images. Here’s simpler problems are solved and these solutions are used as foundations of more difficult problems.

Nvidia’s latest paper puts a new twist in the Progressive GAN approach. Recall that inanimate and animate systems abide by the principle of least action. Intentional systems will employ the minimum amount of effort to learn a task. As an example, CycleGANs game cycle-consistency measures by dispersing information in its latent space that actually compresses information about the original image. These intentional systems tendency is to find ways to game the achievement of goals. If a GAN can cheat then it will find a way to cheat.

What we seek is for a GAN to learn the kind of abstractions that ‘make a difference’ or semantically important. Nvidia’s stylistic-GAN games the discovery of these abstractions by constraining generation to styles at many different levels of resolution. These styles are active at different scales and coincidentally map to the compositional semantics that humans expect.

These stylistic GANs do not learn abstractions about hair, eyes, smiles, skin color, race, pose etc. Rather, it coincidentally learns these abstractions as styles. The innovative twist in their architecture is that the generator is trained to learn the codes to these styles. That is, it learns the DNA of how to generate an image. This is a very novel solution to the disentanglement problem. Rather than craft a regularization that disentangles the latent space, learn the codes instead that disentangles the space at different structural levels. One can think of this as regulating genes.

In the above diagram, the Generator learns a new encoding (w) through using an 8 layer fully connected (MLP) network. This encoding is used as input to learn different styles (A) at different levels of resolution. The AdaIN blocks are style transfer networks. So the difference here between the Progressive GAN network is in the inclusion of a coding component and a style component. One can thus see the invention of more sophisticated kinds of generative behavior that goes beyond simply increasing the image size. Low hanging fruit examples would be pose generation, text-to-speech, volumetric modeling, fluid simulation, and building modeling. I predict that the methodology for generating models will continue to improve and will be extremely useful for scientific discovery. We’ve seen this already in DeepMind’s AlphaFold success.

In biology, we understand the importance of DNA in evolution. DNA preserves beneficial behavior across an agent’s lifespan. The double helix of nucleic acids provides an extremely robust mechanism for preserving information. What is unclear is how was the language of DNA invented? Now that we have this new kind of neural network that learns code, then perhaps it opens an entire thread of exploration where we can understand the mechanisms for learning a language. The gift of Deep Learning methods is that its generative perturbative methods give new tools to explore questions about life and mind that previously could not be explored. We live in incredible times!

Why Stylistic GANs are So Deceptive

Written by Carlos E. Perez