# Adversarial Autoencoders on MNIST dataset Python Keras Implementation

You can find the source code of this post at https://github.com/alimirzaei/adverserial-autoencoder-keras

In this post, I implemented three parts of the Adversarial Autoencoder paper [1]. We can assume the idea of AAE as a combination of Generative Adversarial Networks(GANs) idea and Variational Autoencoders idea. Variational Autoencoders are generative autoencoders which in addition to reconstruction error it tries to minimize the KL-Divergence between the distribution of latent codes and the desired distribution function (in most cases Gaussian). After the training phase, a sample can be generated with a sampling from desired distribution and feeding it to the decoder part.

Generative Adversarial Networks (GANs) are deep neural net architectures comprised of two nets, pitting one against the other (thus the “adversarial”). The generator network tries to generate fake images to fool discriminator and discriminator tries to discriminate fake and real images correctly. GAN was introduced in a paper by Ian Goodfellow and other researchers at the University of Montreal, including Yoshua Bengio, in 2014. By this scheme, the generator learns to generate sample based on training data distribution.

Adversarial Autoencoders (AAE) works like Variational Autoencoder but instead of minimizing the KL-divergence between latent codes distribution and the desired distribution it uses a discriminator to discriminate latent codes and samples from the desired distribution. Using this scheme encoder learns to generate samples which are similar to the desired distribution. For generating a new sample you only need to sample from the desired distribution and feed it to the decoder. The scheme of AAE is shown in the following figure:

# Adversarial Autoencoder

In this section, I implemented the above figure. The desired distribution for latent space is assumed Gaussian. In all implementations in this post, I used **Python** as the programming language and **Keras** as the deep learning framework.

I implement the AAE scheme to generate MNIST images. The MNIST dataset contains *60,000* handwritten number image and each image dimension is *28x28*. So the number of input feature will be *28x28 = 784*

## The Encoder

As paper suggested we used two fully-connected (Each layer has 1000 neurons) layers as the hidden layers of the encoder and an *8* neuron fully-connected layer as an output layer of the encoder. For hidden layers, the *Relu* activation function is used and the output layer does not have any activation function (linear). The below table shows the details of the encoder.

`________________________________________________________`

Layer (type) Output Shape Param #

========================================================

flatten_1 (Flatten) (None, 7) 0

________________________________________________________

dense_1 (Dense) (None, 1000) 785000

________________________________________________________

dense_2 (Dense) (None, 1000) 1001000

________________________________________________________

dense_3 (Dense) (None, 8) 8008

========================================================

Total params: 1,794,008

Trainable params: 1,794,008

Non-trainable params: 0

________________________________________________________

## The Decoder

For the decoder, I used the same architecture of the encoder. For the output layer, we used the sigmoid function as the activation function. The following table shows the detail of the decoder.

`_______________________________________________________`

Layer (type) Output Shape Param #

=======================================================

dense_4 (Dense) (None, 1000) 9000

_______________________________________________________

dense_5 (Dense) (None, 1000) 1001000

_______________________________________________________

dense_6 (Dense) (None, 784) 784784

_______________________________________________________

reshape_1 (Reshape) (None, 28, 28) 0

=======================================================

Total params: 1,794,784

Trainable params: 1,794,784

Non-trainable params: 0

_______________________________________________________

## The Discriminator

The discriminator role is to classify fake and read latent codes, so the output is one neuron. The detailed arch of the discriminator is illustrated at the following table. The activation function for two hidden layers are *Relu* and for the output layer is sigmoid.

`_____________________________________________________`

Layer (type) Output Shape Param #

=====================================================

dense_7 (Dense) (None, 1000) 9000

_____________________________________________________

dense_8 (Dense) (None, 1000) 1001000

_____________________________________________________

dense_9 (Dense) (None, 1) 1001

=====================================================

Total params: 1,011,001

Trainable params: 1,011,001

Non-trainable params: 0

\end{lstlisting}

## Training

I trained the network with a batch size of *100*. For each batch the following procedures are done:

1- **Train Discriminator**: We feed 50 training images to the encoder and assume the obtained latent codes as fake ones (label=0). We also generate 50 samples from desired distribution, the 8-D Gaussian distribution, and assume them as real ones (label=1). After generating these latent codes we train discriminator with these samples and their corresponding labels. The network will be train based on classification error.

2- **Train Autoencoder for Reconstruction Error**: The 100 sample of training images are feed to autoencoder (encoder and decoder) and the autoencoder will be trained based on reconstruction error (MSE).

3-** Train Generator (Encoder)**: At this phase, we have to train generator (encoder) to generate latent codes the same as sampled ones. In other words, the encoder should be trained such that it fools discriminator. For this aim, we freeze the discriminator weights and train encoder and discriminator together such that discriminator classifies the latent codes of feed images as real ones (label=1).

## Results

The following figures show the generated images after 1000 and 4000 epochs. As shown images are sharp and not blur like Variational Autoencoder. The SGD is used for discriminator and generator with learning rate $0.01$ and ADAM with a learning rate of *0.001* for reconstruction phase.

# Incorporating Label Information in the Adversarial Regularization

The previous section is completely unsupervised. In the scenarios where data is labeled, we can incorporate the label information in the adversarial training stage to better shape the distribution of the hidden code. The proposed scheme is shown in the following figure. In this scheme is tried to map the latent codes of each number to a specific Gaussian distribution. In addition, the one-hot code of the label is fed to the discriminator. In our implementation, we used a mixture of 10 Gaussian distribution. We trained this scheme in a semi-supervise manner. for this purpose, an extra dimension is assumed for one-hot encoder (11 dimensions). If a label of a sample is not provided the 11'th element of the code is one and the generated sample are sampled from the whole mixture of Gaussian.

## Implementation & Results

I trained semi-supervised AAE using 40000 labeled sample and 20000 unlabeled samples. The details of the architecture of the network is the same as the previous one. The conditional generated samples are shown in the following image:

and the latent codes of some test images are plotted in the following figure. The details of implementation are accessible in the source code.

# Supervised Adversarial Autoencoder

This section focuses on the fully supervised scenarios and discusses the architecture of adversarial autoencoders that can separate the class label information from the image style information.

In order to incorporate the label information, the paper alters the network architecture of the previous section to provide a one-hot vector encoding of the label to the decoder (The following figure). The decoder utilizes both the one-hot vector identifying the label and the hidden code z to reconstruct the image. This architecture forces the network to retain all information independent of the label in the hidden code z.

## Implementation & Results

I implement the same architecture of the above figure. The output of epoch 1000 is shown in the following figure. In this picture, each row belongs to the same number and each column belongs to the same style.

## Conclusion

In this project, I implemented three scheme from AAE: Original AAE, Semi-Supervised and Supervised. The details of implementation are given in source codes. The optimization algorithms and their learning rates are chosen such that the networks converge correctly.

## Source Code

https://github.com/alimirzaei/adverserial-autoencoder-keras

## Bibliography

[1] Makhzani, Alireza, et al. “Adversarial autoencoders.”

arXiv preprint arXiv:1511.05644(2015).