Intro to GAN

Jinzhen Fan
17 min readJun 27, 2019

--

Hello there, I am pharmaceutical data scientist, newbie to GAN, writing this blog to share some learnings and useful resources with a few co-workers. It is said that in the world of GAN, new breakthrough happens monthly. Try my best to catch some few waves :)

Photo by Teddy Kelley on Unsplash

Before reading:

To sync with my personal learning experience, viewers for this article are expected to have:

  • a good understanding of neural nets and gradient descent.
  • familiarity with at least one deep learning framework (like Tensorflow, Keras, Theano, PyTorch, etc)

If you are very familiar with GAN and its variants, this is not for you.

Summary comes first. The diagram below, made by myself, roughly shows the development of GANs in 2014–2018 in 4 areas of innovation. Mode collapse is where mathematicians seek to solve the gradient vanishing problem and stabilize GAN training. GANs pursuing higher resolution aims to generate better quality images, meanwhile may also contribute to improved training stability. Still the hottest area that gains most public attention is style transfer, or image-to-image transfer, such area include applications where you saw translation of faces of celebrities, or generate Monet paintings from real pictures. Attribute control is where GANs used most statistical distributions. Style transfer borrow knowledge from attribute control GANs.

Development of GAN in 4 directions (2014–2018). Arrow A -> B indicates B is built on top of A.

Original GAN (2014)

Pic from O’Reilly

Discriminator D was trained assign label “real” and “fake” correctly. Generator G was trained to deceive the discriminator. z is the random noise used to generate fake data. We have a value function as:

D and G played a two-player minimax game for this value function, which means D is trained to maximize it while G is trained to minimize. The first part in the value function, log(D(x)), is from real data, and second part, log(1-D(G(z)), is from generated data. In every pass of gradient descent, D is trained to maximize D(x), and minimizing D(G(z). Meanwhile, generator is trained to maximize D(G(z)).

Algorithm:

for each training iteration, do:

for k repeats, do:

  • Sample a minibatch of random noise vector.
  • The generator takes in random noise vectors and returns a minibatch of “generated” image.
  • The discriminator takes in minibatch of generated images and a real images, and returns probabilities, a number between 0 and 1, with 1 representing a prediction of authenticity and 0 representing fake.
  • The discriminator is in a feedback loop with the real images, we update the discriminator D by ascending stochastic gradient of V on parameters of discriminator network.

end for

Update generator G by descending its stochastic gradient of V on parameters of generator network.

end for

GAN outputs:

Rightmost column shows the nearest training example of the neighboring sample, in order to demonstrate that the model has not memorized the training set.

Play with GAN: training process

Play with GAN: Keras code

How to choose length of latent code z?

The optimal length of z largely depends on individual datasets and applications, and how much ambiguity there is in the output. A very low-dimensional latent code may limit the amount of diversity that can be expressed. On the contrary, a very high-dimensional latent code can potentially encode more information about an output image, at the cost of making sampling difficult. The original GAN used a length of 100. Afterwards practitioners has been heuristically using 100 which produced great results. BicycleGAN has studied length of z {2, 8, 256}. Dimension of z are commonly tuned in the range of 2–512, depending on your computing power and diversity of dataset.

If you run into KL divergence

There is a high chance that you will run into a measure called Kullback-Leibler(KL) divergence, a concept introduced by Kullback and Leibler in 1951. Sounds similar to Klybeck.

Klybeck park in Basel, Switzerland

KL is a measure of distance between two distributions, which is frequently used to measure the distance of distributions between real and fake data. KL divergence is often used as part of GAN target function. For example, for fixed G, optimal D will lead to a maximal V of JS divergence, which is sum of two KL divergences:

For discrete probability distributions of P and Q, KL divergence is defined as:

It is equivalent to cross entropy of P and Q, subtracted by the entropy of P.

Kullback–Leibler divergence of 0 indicates that the two distributions in question are identical. The entropy H(P) sets a minimum value for the cross-entropy H(P,Q), thus KL divergence is always non-negative.

KL divergence is asymmetric, meaning

which caused instability in training GAN. A few variants have combined forward and reverse KL divergence to get around this issue.

cGAN(2014)

The generated images from original GAN could not be controlled. Conditional GAN (cGAN) was proposed in 2014, which condition both generator and discriminator on y. For example, y could be one-hot encoded vectors as class labels for MNIST data sets.

A intuitive example is that, cGAN can generate single digit images (0–9) with defined class label. The paper also showed that cGAN is capable to assign tags to images, if tags were used in training as conditions.

LAPGAN(2015)

Laplacian Generative Adversarial Networks (LAPGAN) is proposed to generate high resolution images by stacking multiple layers of generators, from low to high resolution(rightmost to leftmost). At that time, people haven’t find a good way to generate high-resolution pictures from deep CNN.

Generator of LAPGAN (left to right)
Training of LAPGAN

DCGAN(2015)

To solve the instability and resolution issue, deep convolutional generative adversarial networks (DCGANs) incorporate some tricks in CNN to stabilize the network and view meaningful representations in intermediate layers. This was not the first try, but it is a milestone where CNN has achieved great results in GAN.

DCGAN proposed changes to CNN architecture

One key thing DCGAN did was to replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator). Fractional-strided convolutions, sometimes called deconvolution, is the reverse of strided convolution, returning a higher resolution image from lower resolution input.

Generator in DCGAN
Fractionally-strided convolution

The quality of GAN can be examined by using it as a feature extractor. In this paper, discriminator’s convolutional features from all layers were flattened and concatenated to form a vector and a regularized linear classifier is trained on top of them to prove the robustness. Features extracted from DCGAN was shown to be meaningful representation of object structure, see the figure below. It was proved to reduced error rates of classification problems.

Left: Random filter. Right: Learned CNN features from last layer of discriminator.

VAE-GAN (2015)

VAE-GAN is a place where machine learning and statistics corporate. You may find it a bit tricky, because it speaks the language of both neural networks and probability models. Let’s first start with basic concepts of VAE, before proceeding to VAE-GAN.

Autoencoder(AE) is a data representation technique, composed of encoder and decoder, connected by a bottleneck of vectors. AE can be used for de-noising images, removing watermarks, inpainting images. Variational autoencoder (VAE), replaces the bottleneck vector of AE with two vectors, a mean vector, and a standard deviation vector. After the bottleneck layer, there is a sampling step.

http://kvfrans.com/variational-autoencoders-explained/

Target of VAE is to minimize KL divergence between true posterior(p(z|x)) and variational posterior(q(z|x))distribution, which is intractable. Here z is the latent variable, x is the input data like images. This is proved to be equal to maximizing ELBO(Evidence Lower Bound).[proof] Thus, ELBO becomes the target function of VAE.

Pic from Jaan Altosaar

If you want to dig deeper, the first term represent the reconstruction loss from samples of a distribution, and second is to force the learned distribution to be normally distributed Gaussian. By forcing latent vectors into normal distribution, the VAE make it easy to generate unseen data by sampling in latent space. This is the constraint that separates a VAE from prior arts. [why normal distribution] Note that the sampling step after bottleneck can not be back propagated, there is one trick in VAE called re-parameterization that transform mean and sd into parameters, by adding a stochastic variable ϵ. Thus mean and sd can be trained.[explanation]

In practice, the images VAE generated tend to be blurry. VAE-GAN propose a critic function of ELBO plus value function of GAN to improve resolution.

The critic become:

The structure of VAE-GAN is essentially replacing generator with decoder.

A slight change of VAE is that, the first item in VAE loss, the reconstruction loss between x and reconstructed x̄, is replaced with loss between x and 𝑙th layer representation of discriminator network. VAE-GAN training process is like this:

Algorithm:

Initiate network parameters for encoder, decoder, and discriminator networks.

for k repeats do:

  • Random sample a minibatch X from real dataset
  • Get encoded vector Z from encoder network with X as input
  • Compute KL divergence L1 of posterior distribution p(Z|X)and prior p(Z).
  • Generate fake data from decoder network with Z as input.
  • Compute the reconstruction loss L2 between X and 𝑙th layer representation in discriminator network.
  • Sample Zp from prior normal distribution, and generate Xp from decoder with Zp as input.
  • Compute the GAN value function L3 as log(Dis(X)) + log(1 − Dis( )) + log(1 − Dis(Xp))
  • Update encoder network with gradient of L1 + L2.
  • Update decoder network with gradient of γL2 - L3, γ is a weighting parameter between VAE and GAN.
  • Update discriminator network with gradient of L3.

Compared to GAN, VAE-GAN could reconstruct dataset samples with visual attribute vectors added to their latent representations, like changing hairstyle and skin colors .

Code: https://github.com/andersbll/autoencoding_beyond_pixels

S²-GAN(2016)

SS-GAN, or S²-GAN as I’d like to call it, separate structure generator from style generator. Enable changes of underlying structure, or use it as a render that create textures for a sketch of surface normal structure. There are multiple GAN networks in S²-GAN. Structure GAN and style GAN were trained independently first, before jointly training the same model. Within style GAN, there are two discriminators being trained together, one to tell true image from fake images, the other to tell whether surface normal can be restored from generated images.

Generation process
style GAN

It is said to outperform DCGAN and LAPGAN in generated image qualities.

pix2pix(2016)

If you haven’t try the pix2pix software yet, here is the link. It is a very interesting app to generate textured, colorful images from sketches, or a totally different style. Compared to the style GAN from S²-GAN, input can be a few lines of sketch, rather than perfectly structured surface normal map. The following figures showed some applications built with pix2pix model. Pix2pix use cGAN(conditional generative adversarial networks) as a general-purpose solution to image-to-image translation problems. This is not the first use case of cGAN, however, this framework differs in that nothing is applications specific. In each case they use the same architecture and objective, and simply train on different data.

Applications built with pix2pix

For generator pix2pix use a “U-Net”-based architecture, and for discriminator it use a convolutional “PatchGAN” classifier, which only penalizes structure at the scale of image patches. Such a patch-level discriminator architecture has fewer parameters than a full-image discriminator and can work on arbitrarily-sized images in a fully convolutional fashion. It also found it beneficial to mix the GAN objective with a more traditional loss, such as L1 distance.

LS-GAN (2017)

Viewing the discriminator as a classifier, regular GANs adopt the sigmoid cross entropy loss function. When updating the generator, this loss function will cause the problem of vanishing gradients for the samples that are on the correct side of the decision boundary, but are still far from the real data. We will talk more about it on WGAN part. To remedy this problem, the Least Squares Generative Adversarial Networks (LSGANs) was proposed.

When there exists batch effects in dataset, there is a chance for LSGANs to generate relatively good quality images.

Generated images when batch normalization was removed

WGAN (2017)

A known caveat of GAN is called gradient vanishing. Minimizing the GAN objective function with an optimal discriminator is equivalent to minimizing the JS divergence, which is a combination of two KL divergence. However, if the generated image has distribution q far away from the ground truth p, the gradient of JS divergence is zero, and the generator barely learns anything. Wasserstein GAN(WGAN) is proposed to solve gradient vanishing problem of GAN. It replaces the original GAN value function with Wasserstein distance that has a smoother gradient everywhere.

WGAN solved the gradient vanishing problem of GAN when real and fake distributions are far away

Wasserstein distance is the minimum cost of transporting mass in converting the data distribution q to the data distribution p. It is highly intractable, but it can be simplified as the least upper bound of

where f is a 1-Lipschitz function. f can be learned in a similar way as discriminator network, except that it does not have an output sigmoid function, and weight clipping is required to for weights in f to follow the 1-Lipschitz constraint.

A side-by-side comparison of GAN and WGAN(source):

GAN:

WGAN:

WGAN has alleviated gradient vanishing problem of GAN, and showed great improvement when there exists batch effects in input datasets.

WGAN-GP (2017)

A problem of WGAN is that, enforcing the Lipschitz constraint by weight clipping is too brutal force. The model performance is very sensitive to this weight clipping hyper-parameter c. Weight clipping also performs similar to weight regularization. It reduces the capacity of the model f and limits the capability to model complex functions. So instead of applying clipping, WGAN-GP penalizes the model if the gradient norm moves away from its target norm value 1. WGAN-GP was shown to achieve improved model stability, when neural network design is not great.

comparing WGAN-GP with other models

CycleGAN(2017)

In many scenario, many-to-one image mapping would occur, causing mode collapse. CycleGAN and UNIT was proposed to improve one-to-one mapping and prevent mode collapse, by constructing invertible mapping. In this sense, even though there is no specifically paired output, it could learn by inverting the generated image, reconstruct the input, and minimize loss between reconstructed and group truth input. CycleGAN and UNIT could make use of unpaired data. CycleGAN is built on top of pix2pix for style transferring. Recall from pix2pix that, during training, paired sketches X and textured images Y are required to train the GAN. What if X and Y are unpaired? For example, X could be a set of real photos, and Y is a set of random oil painting.

Left: paired input and output images as one-to-one mapping; Right: unpaired input and output images as two unrelated sets.

CycleGAN seek an algorithm that can learn to translate between domains without paired input-output examples. CycleGAN exploit the property that X ->Y and Y -> X mapping should be “cycle consistent”. A cycle consistency loss that encourages F(G(x)) ≈ x and G(F(y)) ≈ y. Combining this loss with adversarial losses on domains X and Y yields our full objective for unpaired image-to-image translation.

Cycle consistency loss:

Forward GAN loss:

where G tries to generate images G(x) that look similar to images from domain Y , while Dᵧ aims to distinguish between translated samples G(x) and real samples y.

CycleGAN loss is a sum of cycle consistent loss, forward and backward GAN loss:

70 × 70 patchGANs was used in discriminator networks, which aim to classify whether 70 × 70 overlapping image patches are real or fake.

CycleGAN

UNIT(2017)

UNIT, like CycleGAN, is also proposed for unpaired image-to-image translation. It assumes a latent space z shared by two image style domain, X₁ and X₂. It couples two VAE-GAN together, X₁ ->X₂, and X₂ ->X₁.

The advantage of using VAE-GAN is that, it is capable of attribute-based style translation.

BicycleGAN(2018)

A significant limitation of most existing image-to-image translation methods is the lack of diversity in the translated outputs. To tackle this problem, BicycleGAN and MUNIT was proposed, where one image is translated into many images in another style. They absorbed the latest advances solving mode collapse and conditional GAN.

BicycleGAN is designed to learn a mapping G : (A ; z) ->B, where B is the mapped image as output, and z is latent code, with A as input image. Unlike its predecessor pix2pix that only learns to reconstruct output B with z as a noise vector, BicycleGAN learns to reconstruct both B and latent code z, meaning z was explicitly enforced to capture useful information. BicyleGAN, by its name, has two hybrid cycles in the architecture, B->z->B̂, and z->B̂->ẑ. The first cycle is learnt by training a cVAE-GAN, and second is by a cLR-GAN. Below is a comparison of pix2pix vs BicycleGAN.

pix2pix:

pix2pix as a baseline model

BicycleGAN:

BicycleGAN architecture

It is mentioned that the model was built on LSGAN, which as we mentioned used a least-squares objective instead of a cross entropy loss, stabilizing the training.

Since the latent code z is meaningful, it could be altered to generate output image with different attributes. Below is an example:

BicycleGAN example results

MUNIT(2018)

MUNIT was proposed to solve multimodal unsupervised image translation as BicycleGAN. Comparing to UNIT, it separate style code from content code. Given a random style code sampled from the style space of the target domain, it allows users to control the style of translation outputs.

MUNIT

Future work: other GANs

There are many other variants of GAN I haven’t reviewed yet, including but not limited to Info-GAN(2016), BiGAN 2016, ALI 2016, EB-GAN(2016), ACGAN(2017), DualGAN, DiscoGAN (2017), PatchGAN, CoupleGAN, DRIT(2018).

Applications in biomedical fields include but not limit to, biomedical imaging augmentation, de-noising/enhancing quality of images, transferring from MRI to CT, organ segmentation, etc. We can do a deeper dive later for this part.

References

[1] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems. 2014.https://arxiv.org/abs/1406.2661

[2] Oord A, Kalchbrenner N, Kavukcuoglu K. Pixel recurrent neural networks[J]. arXiv preprint arXiv:1601.06759, 2016.

[3] Van den Oord A, Kalchbrenner N, Espeholt L, et al. Conditional image generation with pixelcnn decoders[C]//Advances in neural information processing systems. 2016: 4790–4798.

[4] Kingma D P, Welling M. Auto-encoding variational bayes[J]. arXiv preprint arXiv:1312.6114, 2013.https://arxiv.org/pdf/1312.6114.pdf

[5] Kingma D P, Dhariwal P. Glow: Generative flow with invertible 1x1 convolutions[C]//Advances in Neural Information Processing Systems. 2018: 10215–10224.

[6] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.https://arxiv.org/abs/1701.07875

[7] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved training of wasserstein gan,” arXiv preprint arXiv:1704.00028, 2017.https://arxiv.org/abs/1704.00028

[8] Mao X, Li Q, Xie H, et al. Least squares generative adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 2794–2802.

[9] Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks[J]. arXiv preprint arXiv:1511.06434, 2015.

[10] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,” arXiv preprint arXiv:1605.09782, 2016.https://arxiv.org/abs/1605.09782

[11] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville, “Adversarially learned inference,”arXiv preprint arXiv:1606.00704, 2016.https://arxiv.org/abs/1606.00704

[12] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,“Autoencoding beyond pixels using a learned similarity metric,”arXiv preprint arXiv:1512.09300, 2015.https://arxiv.org/abs/1512.09300

[13] Odena A, Olah C, Shlens J. Conditional image synthesis with auxiliary classifier gans[C]//Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017: 2642–2651.

[14] M. Mirza and S. Osindero, “Conditional generative adversarial nets,”arXiv preprint arXiv:1411.1784, 2014.https://arxiv.org/abs/1411.1784

[15] Chen X, Duan Y, Houthooft R, et al. Infogan: Interpretable representation learning by information maximizing generative adversarial nets[C]//Advances in neural information processing systems. 2016: 2172–2180.

[16] Zhao J, Mathieu M, LeCun Y. Energy-based generative adversarial network[J]. arXiv preprint arXiv:1609.03126, 2016.

[17] Huang H, Yu P S, Wang C. An introduction to image synthesis with generative adversarial nets[J]. arXiv preprint arXiv:1803.04469, 2018.

[18] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford,and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems, 2016, pp. 2226–2234.https://arxiv.org/abs/1606.03498

[19] S. Nowozin, B. Cseke, and R. Tomioka, “f-gan: Training generative neural samplers using variational divergence minimization,”arXiv preprint arXiv:1606.00709, 2016.https://arxiv.org/abs/1606.00709

[20] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,“Generative adversarial text to image synthesis,” arXiv preprint arXiv:1605.05396, 2016.https://arxiv.org/abs/1605.05396

[21] X. Wang and A. Gupta, “Generative image modeling using style and structure adversarial networks,” arXiv preprint arXiv:1603.05631, 2016.https://arxiv.org/abs/1603.05631

[22] E. L. Denton, S. Chintala, a. szlam, and R. Fergus, “Deep generative image models using a laplacian pyramid of adversarial networks,” in Advances in Neural Information Processing Systems Curran Associates, Inc., 2015, pp. 1486–1494.https://arxiv.org/abs/1506.05751

[23] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” arXiv preprint arXiv:1611.07004, 2016.https://arxiv.org/abs/1611.07004

[24] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,”arXiv preprint arXiv:1703.10593, 2017.https://arxiv.org/abs/1703.10593

[25] Z. Yi, H. Zhang, P. T. Gong et al., “Dualgan: Unsupervised dual learning for image-to-image translation,” arXiv preprint arXiv:1704.02510, 2017.https://arxiv.org/abs/1704.02510

[26] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim, “Learning to discover crossdomain relations with generative adversarial networks,”arXiv preprint arXiv:1703.05192, 2017.https://arxiv.org/abs/1703.05192

[27] Liu M Y, Breuel T, Kautz J. Unsupervised image-to-image translation networks[C]//Advances in Neural Information Processing Systems. 2017: 700–708.

[28] Zhu J Y, Zhang R, Pathak D, et al. Toward multimodal image-to-image translation[C]//Advances in Neural Information Processing Systems. 2017: 465–476.

[29] Huang X, Liu M Y, Belongie S, et al. Multimodal unsupervised image-to-image translation[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 172–189.

[30] Lee H Y, Tseng H Y, Huang J B, et al. Diverse image-to-image translation via disentangled representations[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 35–51.

[31] https://mp.weixin.qq.com/s/LJfsadhp3WGi0tLXN7jCUA

[32] https://medium.com/@jonathan_hui/gan-wasserstein-gan-wgan-gp-6a1a2aa1b490

[33] http://kvfrans.com/variational-autoencoders-explained/

--

--