A Survey On Autoencoders (Part 1)

Published in

Yıldız Technical University - Sky Lab

5 min readMay 16, 2021

Autoencoders are an unsupervised learning architectures in neural networks. They are commonly used in Deep Learning tasks; such as generative models, anomaly detection, dimensionality reduction. In this article, we will evaluate theoretical approaches of Autoencoders and see it’s extensions.

Introduction

Autoencoders are an unsupervised learning method. They map the input data into lower dimensional space with encoder E, and then maps into same space that have same dimension of input data with decoder D.

The main idea behind Autoencoders is to attempt to copy its input to its output. The input layer is fed with input vector x and the loss is calculated at output layer between x and E(D(x)), in other words the loss is L(x,E(D(x))). It measures difference between our original input and the consequent reconstruction. We named the middle layer, that is connection between encoder E and decoder D, as the “bottleneck”. We can denote our output of bottleneck as h= E(x) and denote our output as x̂ = D(h) = D(E(x)). We can define our encoder and decoder as conditional probability density function that are p_{encoder}(h|x) and p_{decoder}(x̂|h). The loss function is named reconstruction loss which is L(x̂,x). We can treat the process as a feedforward networks; the loss can be minimized via mini-batch statistics following gradients computed by backpropagation algorithm,

The bottleneck is the key of the effectiveness of Autoencoders. We map our input vector to bottleneck: the bottleneck keeps the ‘latent informations’ of input x. The network represents input but in lower dimensions. In other words, it behaves like a approximative compression algorithm. The encoding parameters are learned in training process. Then we map bottleneck information h into same dimension as input x. Then, this procedure can be seen as approximative extracting compressed latent information.

Undercomplete Autoencoders

The simplest idea behind autoencoders is the decreasing the number of nodes through the hidden layers before bottleneck. An autoencoder that has dimension less than the input x is called undercomplete autoencoder. When we minimize the reconstruction error, autoencoder learns to represent latent attributes of input data with lower dimensions than input x’s. his procedure is same as in Principal Component Analysis (PCA) but in non-linear way. When decoder is linear and the loss L(x̂,x) is the L² error, an autocomplete autoencoder learns to span the same subspace as PCA. When autoencoder has non-linear activations, then autoencoder becomes more powerful and generalized in dimensionality reduction, it becames non-linear version of PCA.

Problem Of Autoencoders

When we said that main idea behind autoencoders is to copy input to its output, the key idea is that not to copy without extracting useful informations about the distribution of the data. Autoencoders are allowed too much capacity, easy to be trained to the copying the task with learning anything useful about the dataset. So we need to penalize those autoencoders.

Regularizations

As we said, autoencoders are allowerd too much capacity. Regularized autoencoders can give us the task that find the latent features of input, instead of copying the input. There are lot of regularization methods to prevent copying task such as Sparse Autoencoders or Denoising Autoencoders.

Denoising Autoencoders

We can achieve the task that learning useful informations about data by adding some noise to input data.

To perform the denoising, the input x is corrupted into x̃ through stochastic mapping of x̃ ~p_{N}(x̃|x). Then the noisy (corrupted) input is used for encoding and decoding parts

Then the minimization task is updated as

The denoising can be seen as forcing our model to learn latent features of our input by adding noise to input then penalizing with reconstruction loss L(x,x̂).

Sparse Autoencoders

The sparsity simply comes from adding a shrinkage method to reconstruction loss like in machine learning tasks.

The minimization will be

This procedure can be interpreted as Bayesian inference with terms of posterior, likelihood and prior: posterior ∝ likelihood × prior. From this point maximizing the likelihood is equivalent to maximizing the posterior

Now we can re-write our distribution in terms of prior of h and it’s factor seeing x:

For simplicity, let us consider a zero-mean Laplacean prior

Then the posterior becomes

Other prior distributions like Student-t or Gaussian (L¹) can make an impact for sparsity.

Part 2

In part 2 of this post, we will discuss theoretical side of Variational Autoencoders (VAEs) and try to implement VAEs.

References

Charu C. Aggarwal. Neural Networks and Deep Learning. Berlin, Heidelberg: Springer International Publishing, 2018. isbn: 978–3- 319 94463–0.
Jaan Altosaar. Understanding Variational Autoencoders from two perspectives: deep learning and graphical models. https://jaan.io/what-is-variational-autoencoder-vae-tutorial/.
Carl Doersch. Tutorial on Variational Autoencoders. 2016. arXiv: 1606.05908 [stat.ML].
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www.deeplearningbook.org. MIT Press, 2016.
Diederik P Kingma and Max Welling. AutoEncoding Variational Bayes. 2014. arXiv:1312.6114 [stat.ML].
The variational auto-encoder. https://ermongroup.github.io /cs228-notes/extras/vae/.
Pascal Vincent et al. “Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion”. In: J. Mach. Learn. Res. 11 (Dec. 2010), pp. 3371–3408. issn: 1532-4435.