Fifteen Minutes with FiftyOne: Structured Noise Injection

Controlling attributes of generated images

Published in

Voxel51

6 min readOct 2, 2020

Face generation through the use of generative adversarial networks (GANs) has made significant leaps in recent years. Works like StyleGAN [1] have shown high-quality generation of faces that have become difficult for human observers to distinguish from real faces. Check out thispersondoesnotexist.com for examples of StyleGAN outputs.

While modern networks are able to produce individual realistic images, there are still difficulties in modifying attributes of generated images. For example, it is currently difficult to modify a generated face to go from frowning to smiling.

In a new CVPR 2020 paper, Disentangled Image Generation Through Structured Noise Injection [2], the authors developed a method of disentangling the latent space of GANs to be able to influence specific parts of the generated image. The random inputs used to generate images are injected in a structured, grid-based fashion that leads to the ability to modify them in expected ways during inference time. The image below shows the three “codes” that can be used to inject random inputs into a model and the faces that result from the codes.

Faces generated by structured noise injection through 3 different “codes” shown at the top of the image (source)

Design and implementation

Background

In order to generate random faces, these GANs require some random noise input to work off of. This is generally provided in the form of a vector of random numbers that is the Input Tensor. In previous works, this input noise is passed through a series of fully connected layers and then mapped to a tensor with spatial dimensions (generally 4x4xC where C is some number of channels). This is then passed through the network which has learned how to systematically upscale the input tensor in such a way to generate realistic images.

In previous works, every value of the random input tensor can affect every portion of the generated image.

The image below provides a high-level overview of providing an input tensor (Latent) to a generator (G) and it producing an image to be fed into the discriminator (D).

Generator and discriminator overview (source). The random input is fed in through the Latent vector

Structured Noise Injection Overview

The key insight behind Structured Noise Injection (SNI) is that the random noise used to generate images can be structured into more than just a single tensor. These different structures that are used to inject noise are “codes”. Codes are basically just different sized and shaped random tensors that construct the input to the GAN. The grid-based structure of these noise codes can define how the network will use them to generate the final image. During inference time, these codes can be modified independently to control how the final image will be generated.

There are three different structural codes for the noise that can be controlled:

Local codes: Input noise shaped into 4x4 or 8x8 regions that can be individually manipulated. These allow you to modify localized regions of the generated image.
Global shared codes (global code scale 1 in the image below): 2x2 noise that encompasses quadrants of the local codes. These will allow you to modify higher-level attributes of the image.
Global Codes (global code scale 0 in the image below): 1x1 scalar tiled and shared by every input pixel. This will modify attributes related to every portion of the image.

Overview of how input noise, local codes, global shared codes (1), and global codes (0) are combined to create the input tensor for a GAN (source)

These codes begin to disentangle the effects of the random input noise used to generate images. Particularly, the local codes allow specific locations of an image to be regenerated without affecting other portions of the image. The global codes at different scales provide more largely applicable modifications to attributes like pose and gender.

Digging in with FiftyOne

In this section, we use the dataset and model evaluation tool FiftyOne to visualize various generated faces and qualitatively compare results when changing global and local codes.

Changing global codes

The image below shows an example of the 1x1 global code being modified. As the authors mentioned in the paper, this global code seems to be tied primarily to the rotation of the person's head in the image.

Randomly changing the global code (0) for a generated face, visualized in FiftyOne

Global shared codes seem to modify high-level attributes like age and facial accessories like glasses. The image below shows an example of a generated face with the global shared code randomized.

Randomly changing the global shared code (1) for a generated face, visualized in FiftyOne

Changing local codes

A mask can be provided to indicate the portions of the image that are to be modified. In the example below, the mask is located around the mouth causing different mouth shapes to be generated while the rest of the image mostly stays the same.

Randomly changing the local code in a mask around the mouth for a generated face, visualized in FiftyOne

When positioning the mask around some other portions of the image, the results were less noticeable. Examples of less impactful local codes include a mask around the eyes and the background, neither of which resulted in a significant change in the generated image.

Randomly changing the local code in a mask around the eyes for a generated face, visualized in FiftyOne

Randomly changing the local code in a mask around the background for a generated face, visualized in FiftyOne

Conclusion

Structured noise injection is a novel method that begins to construct the latent space of image generation networks in a way that allows users to modify attributes of the images without losing out on the quality of the final images. Through FiftyOne we see that certain portions of images, like mouths, are more susceptible to modification than other portions, like eyes. While the end goal is to be able to know exactly how an attribute of a generated image is going to be modified, SNI is a great step in that direction by letting you chose the location of attributes that will be modified.

Check out this example to look at these results yourself: https://github.com/voxel51/fiftyone-examples/blob/master/examples/structured_noise_injection.ipynb

About Me

My name is Eric Hofesmann. I received my master’s in Computer Science, specializing in Computer Vision, at the University of Michigan. During my graduate studies, I realized that it was incredibly difficult to thoroughly analyze a new model or dataset without serious scripting to visualize and search through outputs and labels. Working at the computer vision startup, Voxel51, I helped develop the tool FiftyOne so that researchers can quickly load up and start looking through datasets and model results. This series of posts will go through state-of-the-art computer vision models and datasets and analyze them with FiftyOne.