Autoencoders(Part 1): What are Autoencoders?

3 min readJun 27, 2022

A feed forward neural network that encodes its input into a hidden representation and then decodes the input again from this hidden representation.

Well, why Autoencoders?

To learn the important features of the input data xᵢ.
To reduce the dimension of the input data xᵢ by effectively choosing the important features.
More efficient than PCA in some cases, since an autoencoder is nonlinear.
Used for image denoising, compression and generation, anomaly identification, domain adaptation, determining system dynamics…

Deeper understanding of Autoencoders -

1. Dimensions of input data xi and hidden representation h:

a) dim(h) < dim(xᵢ)

An under complete autoencoder.
h is a loss-free encoding of xᵢ and it captures all the important characteristics of xᵢ, if we are able to reconstruct xᵢ cap perfectly from h.
Similar to PCA.

b) dim(h) > dim(xᵢ)

An overcomplete autoencoder.
The model could simply copy xᵢ into h and then copy h into xᵢ cap.
In general, it is useless in practice.

2. Choosing g(xᵢ):

If all the inputs are binary (xᵢⱼ ϵ {0, 1}), g is chosen as the sigmoid/logistic function.
If all the inputs are real numbers (xᵢⱼ ϵ R), g is chosen as a linear function.
In general g is chosen as either logistic function or tanh function.

3. Choosing Loss Function:

a) When xᵢⱼ ϵ R:

Since we will be using a linear function, a mean squared error loss function works well.
Objective Function:

b) When xᵢⱼ ϵ {0, 1}:

As we use a logistic function, the decoder will produce outputs between 0 & 1. These can be interpreted as probabilities.
Thus for binary classes, cross-entropy loss suits well.
Objective function:

4. Regularization:

To avoid poor generalization of the data, we need to regularize the model.

Adding a L2 regularization term to the objective function.
Tie the weights of the encoder and decoder.

Denoising Autoencoders

To make the model robust, the inputs can be corrupted before feeding into the network. In denoising we simply corrupt the input data using a probabilistic process P(Xᵢⱼ| xᵢⱼ) before feeding it to the network.

Example:

Notice that, the data xᵢⱼ instead of input data Xᵢⱼ is not used because the model should learn the robust representation xᵢⱼ, even when the corrupted data is fed into the model.