Autoencoders, Variational Autoencoders (VAE) and β-VAE

7 min readApr 19, 2023

Autoencoders (AE), Variational Autoencoders (VAE), and β-VAE are all generative models used in unsupervised learning. Regardless of the architecture, all these models have so-called encoder and decoder structures. AE is a deterministic model, while VAE and β-VAE are both probabilistic models based on the generalized EM (Expectation Maximization) algorithm.

As an analogy, imagine a certain art gallery is run by Mr. E (Encoder) and Mr. D, who want to showcase a large number of paintings. Although the system is such, that the owners only have one wall (Latent Space). When an artist wants to showcase his painting on the wall, Mr. E assigns a certain location for his painting on the wall (embedding) and then throws the original painting away. When a customer requests to see the painting, Mr. D attempts to recreate the painting based on the location on the wall (reconstruction).

Understanding the Encoder

The encoder consists of a series of neural network layers which extract features from an image and embed or encode them to a low-dimensional latent space. If the encoder consists of a series of Convolution Layers, the resulting architecture is sometimes called CNN-VAE. Although, it has been seen that CNN-VAEs for image data outperform standard Multi-Layer Perceptron (MLP) VAEs. Hence, the two terms are used interchangeably. The architecture of the encoder is generally converging, as the latent space is lower-dimensional than the input space.

Understanding the Decoder

The decoder consists of a series of neural network layers which attempt to recreate the original image from the low-dimensional latent space. The architecture of the decoder is generally diverging. It is not necessary that the encoder and decoder have the same architecture, but in practice, they are usually kept the same. In the case of CNN-VAE, the decoder consists of Convolution Transpose layers which allows us to upsample the input.

Autoencoder (AE)

In the most basic form of an autoencoder, the encoder and decoder are typically composed of fully connected layers (MLP or CNN). The objective of training an autoencoder is to minimize the difference between the input data and its reconstructed output, which is typically measured using a loss function such as mean squared error or binary cross-entropy.

Autoencoders can be used for a variety of tasks, such as denoising, image super-resolution, anomaly detection, and clustering. They can also be stacked to create deeper architectures, such as deep autoencoders or convolutional autoencoders, that are capable of capturing more complex features and patterns in the input data.

One of the main advantages of autoencoders is their ability to perform unsupervised learning, meaning they can learn to extract meaningful features from raw data without requiring labeled data. This makes them useful for tasks where labeled data is scarce or expensive to obtain. Additionally, autoencoders can be trained using a variety of optimization algorithms, such as stochastic gradient descent and its variants, which can scale to large datasets and high-dimensional input spaces.

However, autoencoders also have some limitations. They are susceptible to overfitting, where the model learns to simply memorize the training data rather than learning to generalize to new data. This can be mitigated by adding regularization techniques such as dropout or early stopping. Additionally, autoencoders can be limited by the size of the compressed representation, as the model needs to strike a balance between preserving the most relevant information in the input and minimizing the reconstruction error.

Variational Autoencoder (VAE)

Variational Autoencoders (VAEs) are a type of autoencoder that was introduced to overcome some limitations of traditional AE. VAEs extend the traditional AE architecture by introducing a probabilistic framework for generating the compressed representation of the input data.

In VAEs, the encoder still maps the input data to a lower-dimensional latent space, but instead of a single point in the latent space, the encoder generates a probability distribution over the latent space. The decoder then samples from this distribution to generate a new data point. This probabilistic approach to encoding the input allows VAEs to learn a more structured and continuous latent space representation, which is useful for generative modeling and data synthesis.

To go from a traditional autoencoder to a VAE, we need to make two key modifications. First, we need to replace the encoder’s output with a probability distribution. Instead of the encoder outputting a point in the latent space, it outputs the parameters of a probability distribution, such as mean and variance. This distribution is typically a multivariate Gaussian distribution but can be some other distribution as well (e.g., Bernoulli).

Second, we introduce a new term in the loss function called the Kullback-Leibler (KL) divergence. This term measures the difference between the learned probability distribution over the latent space and a predefined prior distribution (usually a standard normal distribution). The KL divergence term ensures that the learned distribution over the latent space is close to the prior distribution, which helps regularize the model and ensures that the latent space has a meaningful structure.

The optimized term L in the above equation is called the ELBO (Expectation Lower BOund). The loss function for a VAE is typically composed of two parts: the reconstruction loss (similar to the traditional autoencoder loss) and the KL divergence loss. The reconstruction loss measures the difference between the original input and the output generated by the decoder. The KL divergence loss measures the difference between the learned probability distribution and the predefined prior distribution.

VAEs have several advantages over traditional autoencoders. They allow for generative modeling, meaning they can generate new data points from the learned latent space distribution. They also allow for continuous latent space representations, which means that we can interpolate between different points in the latent space to generate novel data points. Finally, VAEs are less susceptible to overfitting than traditional autoencoders since the probabilistic nature of the encoding forces the model to learn a more robust representation of the data.

However, VAEs can also be more difficult to train and require more computational resources. Additionally, the learned latent space representation can be difficult to interpret, and the quality of generated data can be limited by the model architecture and training data.

Beta Variational Autoencoder (β-VAE)

β-VAE is a type of VAE that introduces an additional hyperparameter called β, which controls the trade-off between the reconstruction error and the KL divergence term in the loss function.

In traditional VAEs, the KL divergence term ensures that the learned latent space distribution is close to the prior distribution. However, this can result in a latent space that is too constrained and less expressive, which can limit the model’s ability to capture complex patterns in the data. β-VAE addresses this issue by allowing for a more flexible latent space representation, while still ensuring that the learned distribution is close to the prior distribution.

To go from a VAE to a β-VAE, we simply modify the loss function to include an additional scaling factor, beta, in front of the KL divergence term. This factor controls the strength of the regularization, where higher values of beta result in a more constrained latent space representation, and lower values of beta result in a more flexible latent space representation.

The loss function for a β-VAE is typically composed of three parts: the reconstruction loss, the KL divergence loss (scaled by beta), and a regularization term that measures the complexity of the learned latent space representation.

β-VAEs have several advantages over traditional VAEs. They allow for a more flexible latent space representation, which can improve the model’s ability to capture complex patterns in the data. Additionally, the beta hyperparameter provides a convenient knob to control the trade-off between the reconstruction error and the regularization term, which can help fine-tune the model for specific applications.

However, β-VAEs can also be more difficult to train than traditional VAEs. Tuning the beta hyperparameter can be challenging and requires careful experimentation to find the optimal value. Additionally, the regularization term in the loss function can lead to a more complex optimization landscape, which can make the training process more difficult and time-consuming.

Summary

Which model is better depends on the task at hand. AEs are better suited for tasks like dimensionality reduction and feature extraction. VAEs are better suited for generative tasks like image and text generation, where we want to generate new data points. β-VAE is better suited for tasks where we want to disentangle the underlying factors of variation in the input.

In terms of weaknesses, autoencoders may suffer from overfitting and may not be able to generate new data points. VAEs may suffer from mode collapse, where the model generates similar outputs for different inputs. β-VAE may require more tuning of the beta parameter, and a higher value of beta may lead to a loss of fidelity in the generated outputs.