Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

Attempt to explain PPGN in 15 minutes or less

Jan Maděra

Published in

knowledge-engineering-seminar

12 min readApr 28, 2020

This blogpost’s primary goal is to explain the article [1] by A. Nguyen et al.

Introduction

The goal of plug & play generative network is to generate an output of some type with constraints given by the conditional part. This output could be anything like text, image, or something more abstract, but the article [1], like the other state of the art methods, focuses on image generation.

Let’s start by asking a question.

What motivated authors to write this paper? They were not satisfied with images generated by Deep Generator Network-based Activation Maximization (DGN-AM) [2], which often closely matched the pictures that most highly activated a class output neuron in pre-trained image classifier (see figure 1). Simply said, DGN-AM lacks diversity in generated samples. Because of that, authors in the article [1] improved DGN-AM by adding a prior (and other features) that “push” optimization towards more realistic-looking images. They explain how this works by providing a probabilistic framework described in the next part of this blogpost. Authors also claim that there are still open challenges that other state of the art methods have yet to solve. These challenges are:

Creating an image generator that produces photo-realistic images at high resolution.
Creating an image generator that produces a wide variety of types of images.
Creating an image generator that produces a diversity of samples that is comparable to a variety of images in the dataset.

The Plug & Play Generative Networks (PPGN) are an attempt to overcome these challenges. Just by looking at fig. 1, we can see that the effort was somewhat successful.

Figure 1: [1 p.2] {a} 9 images that most highly activate a cardoon class output neuron in a pre-trained ImageNet classifier. {b} DGN-AM synthesized images that often converge to the mode of high-activating top 9 images {c} 9 images of cardoon randomly picked from the training set. (d) PPGN manufactured images with high quality and significantly larger diversity, which better represent the diversity of cardoon class images (see {c} Real: random 9) in comparison with DGN-AM.

Probabilistic framework

What parts are in the probabilistic framework? Authors used their version of the Metropolis-adjusted Langevin algorithm (MALA)(more details are in [3,4] and sections S6 and S7 in[1]) to generate images. The MALA uses the following transition operator:

equation 1: [1 p.3]

Assume we have joint model p(x,y) where X is image space, and Y is space of classification of images X. We wish to generate images using this model for sampling:

equation 2: [1 p.3]

Authors describe this model as a “product of experts,” which is a very efficient way to model high-dimensional data that simultaneously satisfies many different low-dimensional constraints. “Expert” p(y|x) constrains a condition for image generation (for example, the image has to be classified as “cardoon”). Prior expert p(x) ensures that the samples are not unrecognizable “fooling” images with high p(y|x), but no similarity to a training set images from the same class.

Instead of writing sampler for the full joint model, authors fixed y to a chosen class y_c. With few adjustments described in the article [1 p.3] and using chosen class y_c we get from equation 1 to this update rule:

equation 3: [1 p. 3] Update rule of iterative sampling used in the image generator

In this update rule, there are three terms with epsilon 1, 2, and 3 (I will use the symbol ∈ for epsilon). These terms bias image xₜ to look more like some another image (or as described by authors terms push sample xₜ to “take a step toward another image” in the space of images) in the training set. These biases can be interpreted as follows:

Epsilon 1 (∈1) term: Take a step from the current image xₜ toward an image that looks more like any other image from the training set. This term ensures that we do not end up with a “fooling” image that causes the classifier to output high confidence in the chosen class, but would simultaneously look like a mess for a human observer.
Epsilon 2 (∈2) term: Take a step from the current image xₜ toward an image that causes the classifier to output high confidence in the chosen class.
Epsilon 3 (∈3) term: Take a step from the current image xₜ in a random direction. This random step encourages a diversity of images.

Remember that the size of the given epsilon can regulate the effect of each term. So, for example, epsilons used in training of PPGN-x (explained in the next part of this blogpost) had these values:

(∈1, ∈2, ∈3) = (1, 10⁵ , 25.6)

So what does Plug & Play in the title mean? Three epsilons can be changed (played with) to choose optimal values. It is possible to “plug and play” with different generator networks priors p(xₜ) and conditions neural networks p(y = y_c|xₜ). Simply said, there are parameters to be played with and generative and conditional networks to be plugged in.

Now that we have explained the basic framework of PPGN, let’s plug in something specific instead of these abstract terms in eq. 3.

Plug & Play Generative Networks

I will start with the ∈1 term in eq. 3. Authors in [1 p.4] state that “Previous models are often limited in that they use hand-engineered priors when sampling in either image space or the latent space of a generator network.” They overcome the need for hand-engineered priors with the usage of denoising autoencoder (DAE).

Autoencoders are Neural Networks with the same number of inputs and outputs used for feature selection and extraction. One of its problems is that when the number of nodes in the hidden layer is equal or higher than the number of inputs or the hidden units are given enough capacity, the autoencoder tends to learn identity function (outputs are the same as inputs) and become useless. Denoising autoencoder (DAE) is a subtype of autoencoders that solve the problem of learning identity function by introducing noise (corruption) to the input data. More info about DAE can be found in [5].

Equation 4: [1 p. 4] With the help of DAE, we can approximate ∈1 term in equation 3. Rₓ(x) is the output of the denoising autoencoder (DAE)

DAE allows us to approximate the ∈1 term indirectly by approximating gradient of the log probability if we train DAE using Gaussian noise with variance σ² as is explained in [6 p. 5]. This approximation can then be used by sampler to make steps from image x of class c toward an image that looks more like any other image from the training set as in ∈1 term in equation 3. The updated equation 3 looks like this:

equation 5: [1 p. 5] equation 3 with the updated version of ∈1 term

Now that we have DAE composed into our equation, let’s look at PPGN with different DAE networks that authors of [1] have described.

Figure 2: [1 p.4] Scheme of all PPGN with different learned prior networks that are described in the [1] article

PPGN-x

PPGN-x is a basic PPGN that is using DAE to model p(x) directly. This PPGN-x has a problem with:

sparse modeling of data distribution, which manifests in fig. 4 in images that become blurry over time.
Images are changing only slightly in hundred of steps, as seen in fig. 4, which is caused by chain mixing too slow.

Images in this dataset have a high resolution of 227x227= 51529 pixels, which means that DAE has 51529 inputs and 51529 outputs. Poor mixing in the high-dimensional space (pixel space in our case) is expected, and mixing with deeper architecture (rather than broader architecture with 51529 inputs) has the potential to result in faster exploration of the x (image) space. So we may lower the dimensionality and perform sampling in the lower-dimensional space, which authors call h-space (See hypotheses in [7]).

Figure 4: [1 p.24] Samples generated from a single sampling chain using PPGN-x starting from a real image on the left across different models. The first-row sampling chain is conditioned on the “planetarium” class and the second on the “kite” class (a type of bird)

figure 5: [1 p.25] Same as fig. 4, but starting from random images. Tha sampling chain of PPGN-x in the image space mixes poorly.

Deep image generator network activation maximization (DGN-AM)

DGN-AM is sampling without a learned prior. It searches for code h such that image generated by generator network G (with code h on input) highly activates the neuron in the output layer of DNN that corresponds to a conditioned class.

Figure 7: [2 p.3] Visualization of DNN and deep image generator network on which the activation maximization (DGN-AM) is performed. In this example, “candle” is class (or a neuron in the output layer of DNN) that we want to be highly activated by generated image (green tile). Both DGN and DNN have fixed parameters, and optimization only changes the input code h of DGN (red layer).

As can be seen in fig. 8 and 9, the DGN-AM converges at least to something in comparison to PPGN-x but still mixes poorly (slowly) because it tends to yield the same image after many sampling steps.

Figure 8: [1 p.24] Samples generated from a single sampling chain using DGN-AM starting with a real image from the training set. The first-row sampling chain is conditioned on the “planetarium” class and the second on the “kite” class (a type of bird).

Figure 9: [1 p.25] Same as fig. 8, but starting from a random code h, which produces the leftmost images after being pushed through generator network G.

PPGN-h

The problem of the poor mixing speed of DGN-AM is somewhat solved by the introduction of DAE (denoising autoencoder) to DGN-AM, where it is used to learn prior p(h). In the case of this paper, the authors used DAE with seven fully-connected layers with sizes 4096–2048–1024–500–1024–2048–4096. The chain of PPGN-h mixes faster than PPGN-x as expected, but quality and diversity are still comparable with DGN-AM, which authors attribute to a poor model of p(h) prior learned by DAE.

Figure 11: [1 p.24] Samples generated from a single sampling chain using PPGN-h starting with a real image from the training set. The first-row sampling chain is conditioned on the “planetarium” class and the second on the “kite” class (a type of bird).

Figure 12: [1 p.25] Same as fig. 11, but starting from a random code h, which produces the leftmost images after being pushed through generator network G.

Joint PPGN-h

The poor modeling of h features space in PPGN-h can be resolved by not only modeling h via DAE but also through generator G. G generates realistic-looking images x from features h. We can then encode this image back to h through two encoder networks. To get true joint denoising autoencoder authors also add some noise to h, image x, and h₁.

We can observe that the sampling chain of Joint PPGN-h mixes faster (more diverse images) than PPGN-h, and authors also say that it produces samples with better quality than all previous PPGN treatments whatever it means. In my opinion, the bird samples do not look like “kite” species as opposed to an earlier PPGN-h in fig. 11 and 12, and the planetarium samples look still as weird as samples generated by PPGN-h.

Figure 14: [1 p.24] Samples generated from a single sampling chain using Joint PPGN-h starting with a real image from the training set. The first-row sampling chain is conditioned on the “planetarium” class and the second on the “kite” class (a type of bird).

Figure 15: [1 p.25] Same as fig. 14, but starting from a random code h, which produces the leftmost images after being pushed through generator network G.

The Noiseless Joint PPGN-h

The authors also tested a variant of Joint PPGN-h with different levels of added noise and empirically found out that Joint PPGN-h with infinitesimally small noise (so-called Noiseless Joint PPGN-h) produces better and more diverse images. In comparison with DGN-AM, the chain mixes substantially faster but slightly slower than Joint PPGN-h. On the other hand, the elimination of noise leads to better image quality.

Authors also observed this in [1 p.7]: “Sweeping across the noise levels during sampling, we noted that larger noise amounts often results in worse image quality, but not necessarily faster mixing speed. Also, as expected, a small ∈1 (Note: see epsilon 1 term explanation in Probabilistic framework part of this blogpost to know what ∈1 changes) multiplier makes the chain mix faster, and a large one pulls the samples towards being generic instead of class-specific.”

Figure 17: [1 p.24] Samples generated from a single sampling chain using Noiseless Joint PPGN-h starting with a real image from the training set. The first-row sampling chain is conditioned on the “planetarium” class and the second on the “kite” class (a type of bird).

Figure 18: [1 p.25] Same as fig. 17, but starting from a random code h, which produces the leftmost images after being pushed through generator network G.

Capabilities of Noiseless Joint PPGN-h

Let’s abuse the “plug and play” property of Noiseless Joint PPGN-h to the next level.

Generating images conditioned on classes

We can plug and play different condition components and challenge the generator to produce images it has never seen before. First, what happens if we replace, for example, image classifier component with AlexNet DNN trained to classify new 205 categories of scene images on which the generator was never trained? If the Noiseless Joint PPGN-h is conditioned to generate pictures of places that the generator was never taught to create, the result can be seen in figure 19.

figure 19: [1 p.7] Images synthesized conditioned on MIT Places classes instead of ImageNet classes

Generating images conditioned on captions

Conditioning on captions is another thing that can be done with Noiseless Joint PPGN-h. By replacing image classifier with an image-captioning recurrent network that was trained on the MS COCO dataset to predict a caption y given an image x. It can generate reasonable images in many cases, but image quality is, of course, lower in comparison with conditioning based on classes because of the wider variety of captions. Fooling images also appear sometimes. In the case of these images, the generator fails to produce a high-quality image.

Figure 20: [1 p.7] Images synthesized to match a text description using an image-captioning recurrent network.

Generating images conditioned on hidden neurons

If the PPGN can generate images conditioned on classes, which are the neurons in the output layer of DNN of image classifier, it can undoubtedly create images conditioned on other neurons in hidden layers. Generating images conditioned on neurons in hidden layers can be useful when we need to find out what exactly has specific neurons learned to detect.

Figure 21: [1 p.8] Images synthesized to highly activate a hidden neuron identified as a “face detector neuron.” Created images can help us to find out what the neuron has learned to detect.

Inpainting

If one part of the image is missing, then the PPGN can fill it in, while being context-aware. The authors compared PPGN with the Context-Aware Fill feature in Photoshop. I think that PPGN is doing the filling job well, even when it was not trained to do so.

Other interesting materials

Authors of [1] also attached a link with published code, trained networks, and videos of PPGN in action. Unfortunately, the link http://www.evolvingai.org/ppgn does not work at the moment either because the authors changed the address or they completely shut it down.

At least the code repository can still be found on https://github.com/Evolving-AI-Lab/ppgn. We can at least hope, that the new link to additional materials will be added here.

Video 1: animation of the sampling chain of PPGN between 10 different classes and within single classes.

Conclusion

I have tried to simplify the explanation of PPGN from paper [1]. First explaining what led authors to build PPGN. Then describing the framework of PPGN with simplified math. Furthermore, the main differences between versions of PPGN were said, starting with the simplest PPGN-x and gradually adding features until we got to Noiseless Joint PPGN-h. Finally, some exciting possibilities of Noiseless Joint PPGN-h were shown, like inpainting missing parts of images or image generating based on multiple word captions. There are also additional materials you can use to understand this topic furthermore.

References

[1] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, J. Yosinski. Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space. arXiv preprint arXiv:1612.00005v2, 2017.

[2] A. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, J. Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. arXiv preprint arXiv:1605.09304, 2016.

[3] G. O. Roberts and J. S. Rosenthal. Optimal scaling of discrete approximations to langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268, 1998.

[4] G. O. Roberts and R. L. Tweedie. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pages 341–363, 1996.

[5] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.

[6] G. Alain and Y. Bengio. What regularized auto-encoders learn from the data-generating distribution. The Journal of Machine Learning Research, 15(1):3563–3593, 2014

[7] Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai. Better mixing via deep representations. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 552–560, 2013.

[8] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, P. Dollár. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.

[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014