Review — AAE: Adversarial Autoencoders (GAN)

GAN Combined With Autoencoder

Sik-Ho Tsang
Nerd For Tech
Published in
6 min readApr 24, 2021


In this story, Adversarial Autoencoders, (AAE), by University of Toronto, Google Brain, and OpenAI, is briefly reviewed. Only the AAE variants are described. This is a paper by Ian Goodfellow, who is also the first author of GAN. In this paper:

  • AAE is a probabilistic autoencoder that uses GAN.
  • The decoder of the adversarial autoencoder learns a deep generative model that maps the imposed prior to the data distribution.

This is a paper in 2016 ICLR with over 1600 citations. (Sik-Ho Tsang @ Medium)


  1. AAE: Network Architecture
  2. AAE vs VAE
  3. Supervised AAE
  4. Semi-supervised AAE
  5. Unsupervised AAE
  6. Dimension Reduction for Data Visualization Using AAE

1. AAE: Network Architecture

AAE: Network Architecture (+: Positive Samples, -: Negative Samples)
  • The top row is a standard autoencoder that reconstructs an image x from a latent code z.
  • The bottom row diagrams a second network trained to discriminatively predict whether a sample arises from the hidden code of the autoencoder or from a sampled distribution specified by the user.
  • Let p(z) be the prior distribution we want to impose on the codes, q(z|x) be an encoding distribution and p(x|z) be the decoding distribution.
  • Also let pd(x) be the data distribution, and p(x) be the model distribution. The encoding function of the autoencoder q(z|x) defines an aggregated posterior distribution of q(z) on the hidden code vector of the autoencoder as:
  • It is the adversarial network that guides q(z) to match p(z).

The autoencoder attempts to minimize the reconstruction error.

The generator of the adversarial network is also the encoder of the autoencoder q(z|x). The encoder ensures the aggregated posterior distribution can fool the discriminative adversarial network into thinking that the hidden code q(z) comes from the true prior distribution p(z).

  • Both, the adversarial network and the autoencoder are trained jointly with SGD in two phases: the reconstruction phase and the regularization phase.
  • The reconstruction phase trains the autoencoder.
  • The regularization phase trains the GAN.

2. AAE vs VAE

  • The hidden code z of the hold-out images fits to (A/C) a 2-D Gaussian and (B/D) a mixture of 10 2-D Gaussians.
  • A: The learned manifold by AAE exhibits sharp transitions indicating that the coding space is filled and exhibits no “holes”.
  • C: VAE roughly matches the shape of a 2-D Gaussian distribution. However, no data points map to several local regions of the coding space indicating that the VAE may not have captured the data manifold as well as the AAE.
  • B: AAE successfully matched the aggregated posterior with the prior distribution.
  • D: In contrast, the VAE exhibit systematic differences from the mixture 10 Gaussians.

3. Supervised AAE

Supervised AAE
  • Before going into semi-supervised AAE, supervised AAE is tried where the architecture separates the class label information from the image style information.
  • The decoder utilizes both the one-hot vector identifying the label and the hidden code z to reconstruct the image.

This architecture forces the network to retain all information independent of the label in the hidden code z.

4. Semi-Supervised AAE

Semi-Supervised AAE
  • The supervised AAE is further modified as above semi-supervised AAE.

The inference network of the AAE predicts both the discrete class variable y and the continuous latent variable z using the encoder q(z,y|x).

  • The first adversarial network imposes a Categorical distribution on the label representation. This adversarial network ensures that the latent class variable y does not carry any style information.
  • The second adversarial network imposes a Gaussian distribution on the style representation which ensures the latent variable z is a continuous Gaussian variable.
  • Both of the adversarial networks as well as the autoencoder are trained jointly with SGD in three phases — the reconstruction phase, regularization phase and the semi-supervised classification phase.
  • In the reconstruction phase, the autoencoder updates the encoder q(z,y|x) and the decoder to minimize the reconstruction error of the inputs on an unlabeled mini-batch.
  • In the regularization phase, each of the adversarial networks first updates their discriminative network to tell apart the true samples from the generated samples.
  • The adversarial networks then update their generator to confuse their discriminative networks.
  • In the semi-supervised classification phase, the autoencoder updates q(y|x) to minimize the cross-entropy cost on a labeled mini-batch.
Semi-supervised classification performance (error-rate) on MNIST and SVHN
  • It is worth mentioning that all the AAE models are trained end-to-end, whereas the semi-supervised VAE models have to be trained one layer at a time.

On the MNIST dataset with 100 and 1000 labels, the performance of AAEs is significantly better than VAEs.

5. Unsupervised AAE

Unsupervised clustering of MNIST using the AAE with 16 clusters

The architecture is the semi-supervised AAE, with the difference that the semi-supervised classification stage is removed and thus no longer train the network on any labeled mini-batch.

  • As seen above, the digit 1s and 6s that are tilted (cluster 16 and 11) are put in a separate cluster than the straight 1s and 6s (cluster 15 and 10).
Unsupervised clustering performance (error-rate) of the AAE on MNIST
  • Once the training is done, for each cluster i, the major correct label is assigned to all the points in the cluster i. Then the test error can be estimated based on the assigned class labels to each cluster.

As shown in the above table, the AAE achieves the classification error rate of 9.55% and 4.10% with 16 and 30 total labels respectively.

6. Dimension Reduction for Data Visualization Using AAE

Dimensionality reduction with adversarial autoencoders

The final n dimensional representation is constructed by first mapping the one-hot label representation to an n dimensional cluster head representation and then adding the result to an n dimensional style representation.

  • n=2 or 3 for data visualization.
  • The cluster heads are learned by SGD with an additional cost function that penalizes the Euclidean distance between of every two of them.
Semi-Supervised and Unsupervised Dimensionality Reduction with AAE on MNIST.
  • (There are details for this part, please feel free to read the paper directly.)
  • Overall, we can see that AAE can achieve a clean separation of the digit clusters.

This paper is an early paper for GAN using autoencoders. The main goal of using VAE in this paper is to have semi-supervised or unsupervised learning, rather than purely image-to-image translation or synthesizing images using latent vectors.


[2016 ICLR] [AAE]
Adversarial Autoencoders

Generative Adversarial Network (GAN)

Image Synthesis [GAN] [CGAN] [LAPGAN] [AAE] [DCGAN] [CoGAN] [SimGAN]
Image-to-image Translation [Pix2Pix] [UNIT]
Super Resolution [SRGAN & SRResNet] [EnhanceNet] [ESRGAN]
Blur Detection [DMENet]
Camera Tampering Detection [Mantini’s VISAPP’19]
Video Coding
[VC-LAPGAN] [Zhu TMM’20] [Zhong ELECGJ’21]

My Other Previous Paper Readings



Sik-Ho Tsang
Nerd For Tech

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.