Machine Learning Zuihitsu — VII

Eren Ünlü
Nerd For Tech
Published in
12 min readJul 1, 2022

Practical Perspectives on Variational Autoencoders

Dr. Eren Unlu, Data Scientist and Machine Learning Engineer @Datategy, Paris

Variational Autoencoders (VAEs) can be considered one of the relatively underrated architectures of the Deep Learning literature. Since their introduction by seminal paper [Kingma et al., 2013], various extensions are developed on top of this vanilla model improving its performance and diversifying use cases. However, with a quick github search, one can deduce that practical implementations of VAEs are somehow redundant, if not limited.

In this article, I would like to share my personal experience with VAEs, which provided me a competitive edge with their robust and accurate high dimensional embedding abilities both in unsupervised, but also supervised tasks. Unfortunately, as just being said, I found that it is particularly hard for a newcomer to the subject to find an accurate practical plug-and-play solution.

We can attribute the relative unpopularity of VAEs to their unfortunate timing, which had been overshadowed by then emerging Generative Adversarial Networks (GANs) [Goodfellow et al., 2014]. (Now, with Generative Diffusion Models [Sohl-Dickstein et al., 2015]). This also stems from the fact that VAEs’ generative capabilities were highlighted much more from day one compared to its efficient representation learning potential. So, VAEs missed the opportunity to be venerated by hot-shot AI gurus, receiving merits like ‘the most interesting idea in deep learning in the last decade’, like GANs did. And it is particularly a just assessment, where it is true that vanilla GAN provided a significant boost on deep generative skills compared to a vanilla VAE, at the time.

However, unfortunately during this eclipse, we may have slightly undershot embedding potential of VAEs, especially flourishing the architecture as a ‘staple model’ for data scientists. This might seem as an exaggeration, but I had the chance to play around numerous and diverse representation learning use cases in my career, where it is hard not to mark VAEs as the ‘to go models‘ for the task. For example, full Bayesian Networks (Deep Belief Networks) [Hinton, 2009] are hard to train to produce plausible results due to their necessity for intricate training regimes and complexities. And pseudo-Bayesian networks practically fail to learn an accurate probabilistic manifold, where a false variation is introduced through activating dropouts in inference. Not to mention, old school linear decomposition methods which can not capture non-linearity or relatively recent algorithms such as t-SNE and UMAP which fail to project true global similarity and semantic covariations. Hence, to my rigorous experiences, a vanilla VAE (or its extensions such as CVAE, RTVAE etc.) finds itself a place on the sweet spot. And a regular auto-encoder, like any non-variational deep learning architecture, always learns an overfit and narrow manifold. I am sure that a data scientist who plays around a little bit with these models would appreciate its efficiency apart from pure generative capabilities.

In addition to this, VAEs deserve attention even just for their incorporation of simple yet powerful and innovative ideas, such as reparameterization trick, log-variance etc, distributional regularization etc. Hence, in this document, in consecutive bullet points, I would like to reiterate very quickly over the basics of vanilla variational autoencoders; address certain problems in several available open source implementations and mention niche and subtle manipulations when building and training VAEs through examples. And as you might have guessed, the focus will be on their latent compression on tabular data rather than the generative context. Interestingly, I noticed that similar observations are made by other data scientists in the community, such as https://towardsdatascience.com/variational-autoencoder-demystified-with-pytorch-implementation-3a06bee395ed

1. Vanilla VAE

As I just mentioned, a vanilla VAE architecture bases itself on several very simple but beautiful and performant ideas. It is hard for a machine learning engineer not to fall in love after examining it. Here, I assume, the reader has basic knowledge of deterministic deep auto-encoding and theory of latent data embedding. Different from a regular deterministic auto-encoder, a vanilla VAE has a bottleneck layer, where all probabilistic magic happens. Apart from this layer, every other component is deterministic like the regular AE. Straightforwardly, the deterministic sub-network mapping the high dimensional input to this variational layer is referred as the ‘encoder’ and the deterministic sub-network reconstructing the data from the probabilistic latent space is called the ‘the decoder’.

Simplified architecture of a vanilla VAE with two dimensional latent space.

Figure shows a simplified vanilla VAE architecture, where only 2 latent dimensions are used. Let us try to recapitulate VAE theory through this specific example. Just like regular AEs, the deterministic encoder compressed data to a lower dimension. At this point, in the variational latent layer (2 dimensions in this specific case), we switch the context and the objective a little bit. Now, what we want to learn is a probability distribution of two variables, ideally a Gaussian distribution (which we will talk about a little bit, later on.)

As a Gaussian distribution can be fully defined by its 2 momentums, mean and variance, the central idea is to learn a ‘mean’ and ‘variance’ parameter (mean and variance neurons). So, the latent representation of data is modeled as a multinomial Gaussian distribution. Certain interesting attributes of VAE architecture starts from this point. Due to intractability and complexity of the issue, follow a Monte Carlo approach. So, at each iteration, we sample data points from this learned distribution, which is at the root of VAE’s probabilistic nature.

However, there is a certain catch here. The whole objective is to train all parameters of this network, in an end-to-end fashion. This intra-sampling strategy totally disrupts the notion of gradient descent, where we need to somehow transform it to something differentiable. Vanilla VAE paper offers a simple and elegant strategy at this point, which is called ‘reparameterization trick’. The idea is to generate a vector of unit normal random variables. We define the sampled vector as the probabilistic deviation from the learned mean vector, with learned variance magnitude multiplied by this random vector. With this simple trick, we are able to introduce a definition of latent vector which is differentiable for the backpropagation algorithm, as the stochastic component is linearly detached.

Reparameterization trick is a simple yet powerful idea, allowing the objective becomes differentiable to conform backpropagation. Using logarithm of variance to please gradients is another effortless brilliant idea among various in the vanilla VAE proposal.

There is also another elegant touch at this very step. As variance is a strictly positive phenomenon, we would have difficulties to learn such a parameter in deep neural context; as zero centered values conform more to learning procedure. So, rather than learning the deviations directly, we prefer to learn the logarithm of variance, which can be inverted simply in the sampling step.

Another interesting approach by the original VAE paper is the proposition of a ‘regularization term’ in the loss term. The central idea is to force the latent space distribution (mean-μ values in case of VAE) to a unit normal distribution. This, in theory, would project data on a much more ‘smooth’, probabilistic, generalizable manifold, which in turn would also yield better generative properties. In order to do that, we add a regularization loss to the overall loss definition, which is simply a Kullbeck-Leibler (KL) Divergence measure between the instantaneous latent space distribution and a multinomial unit Gaussian distribution, which reflects the similarity.

In theory, a true Gaussian distribution of latent samples might be necessary to achieve ideal VAE conditions, with its specific pre-assumptions. However, in practice, we never seek this true multinormality. In the practical world, the idea is just to ‘force enough’ the network to have a smooth representation.

The original paper proposes to use ELBO (Evidence Lower Bound), which is a lower bound on the objective, but in theory provides a true optimization on proper reconstruction of the data with certain a priori assumptions on input data distribution.

ELBO loss on VAEs. [https://jaan.io/what-is-variational-autoencoder-vae-tutorial/]

As said just previously, the overall objective of a VAE can always be thought as a linear sum of reconstruction loss and latent space regularization loss (KD-loss). Even though from a theoretical perspective, this loss function definition is not a zero-sum game, in the practical world, forcing latent space to smooth always decreases the reconstruction capabilities (if not only seen data). Therefore, it is common in literature to treat these two components of the overall loss function as contenders.

Actually, now the real challenge of taming VAEs starts. It seems like, the open source VAE implementations are flooded with MNIST example, which was at the time of publishing of Kingma’s paper ‘to go dataset’ for generative demonstrations. As you know, original MNIST dataset can be considered as binary images, as digits are written in white on dark background (even though, in reality being continuous gray scale). Therefore, VAE implementations of this example, use binary cross-entropy for reconstruction error. Unfortunately, we observe that most of the open source implementations either PyTorch or Tensorflow always follow the same example, with same loss definitions. I remember seeing this BCE loss term (along with sigmoid activation at reconstruction) for color image examples, and even tabular data. To me it looks like, the keras tutorial on VAEs have somehow dominated the open source implementations, which is very dangerous.

For certain tabular data examples, I have seen many implementations using MSE loss, which in theory, as you know corresponds to Maximum Likelihood Estimation, if data is assumed to Gaussian Distribution. Usage of MSE is totally correct, at least in practice, but one needs to understand the proper balancing with very different magnitudes of reconstruction error and latent space regularization KL divergence term.

For tabular data, one of the latest and most interesting papers out there shall be the [Akrami, 2022]. The paper deliberately focuses on tabular data assuming Gaussian distribution for continuous variables, and Bernoulli for categorical features, properly focusing on an ELBO loss for reconstruction. In addition, they provide a beta divergence mechanism to handle the presence of outliers in training data, further augmenting the latent space generalizability.

2. Balancing Reconstruction Error and Regularization

Even without making a literature survey, but just experimenting on your own with VAEs; you would understand that, actually, the most important challenge on the issue is the proper definition of a reconstruction error and carefully balancing the overall loss function with inclusion of the regularization term, which forces the latent space distribution towards a unit multinomial Gaussian distribution. But as you can see, there is an ‘apples and oranges’ issue in this. Reconstruction error can be in the scale of 1000s, whereas KL loss stays below 1; which in turn reconstruction dominates the overall objective. So, unfortunately, as with most of the things in deep learning theory there’s no proper and exact answer to this discrepancy, where we need to go empirical.

If you check out the literature, you will see that there are really interesting papers on the issue. The first thing comes to mind is to add a simple linear coefficient to the reconstruction term, tuning down the dominance. These are generally referred as β-VAEs [Higgins, 2016], β being the factor. However, one can particularly see that there is no theoretical consensus on a valid balancing mechanism.

An interesting paper, which scrutinizes the issue is [Asperti, 2020], which I found particularly interesting. They base their perspective on a theoretical ground for balancing term derivation of a previous paper [Dai, 2019]. This previous paper proposes to define reconstruction error as a function of MSE, whilst learning a true gamma parameter in end-to-end context for regularization balance.

[Dai, 2019][Asperti, 2020]

[Asperti, 2020] makes an observation that learning gamma may be unnecessary, but it may be inferred deterministically with necessary remanipulation of suggested overall loss definition.

Among other interesting proposals in the literature for VAE loss balancing are [Chen, 2018] which proposes logcosh of reconstruction error and [Zhao, 2017] using mutual information (I will use this InfoVAE loss in one of my examples in next section).

3. Some Practice

Enough with theory, let’s get our hands dirty. Due to introductory nature of the article, I will go with a very simple dataset and VAE architecture. Let’s use the popular wine dataset with 13 features, all being numerical. And we will use a very simple vanilla VAE architecture. You can find the codes to regenerate this experiment in the annex. I use a symmetric 2 layer VAE in encoder and decoder, encoding layers composing of 8 and 4 neurons respectively, with relu activations and batch normalization layers. We have 2 dimensional representation layer, (thus, learnable 2 mean and variance components).

The idea in this experiment is to show the importance of a proper reconstruction error selection and balancing it with the KL divergence loss. First, we will use a simple MSE loss, summed directly with the regularization loss without re-weighing. At each experiment, we will have a gif showing a dynamical GIF, changing each epoch, showing 4 graphs, the distribution of data points on 2 dimensional latent space, reconstruction loss, KL divergence loss and overall loss.

Note that the wine dataset is for classification in its nature, where each wine sample is one of 3 classes. I am not using this target class for auto-encoding features, but to color data points on latent scatter plot.

Evolution of latent space and losses without using a coefficient for reconstruction error. Even though, the overall loss is totally dominated by the reconstruction loss due to discrepency of magnitudes, still, KL loss evolves somewhat in an acceptable fashion.
Same experiment with factoring (beta) reconstruction loss by 0.1. As you can see, now the center of mass of the overall loss is on the KL regularization.
Using the proposed mutual information based overall loss in [Zhao, 2017].

4. Conclusion

4.1 VAE idea is excellent in its nature, with very simple and brilliant touches, such as reparametrization trick, or enforcement of an a priori latent distribution to increase manifold smoothness.

4.2 With proper design and training strategies, VAEs look like the ‘to go method’ for lower dimensional data embedding. However, their generative capabilities made the headlines, overshadowed by more complex, newer architectures, hindering its manifold learning abilities.

4.3 There is not enough resources on VAEs. A new comer data scientist can easily fall in to traps of wrong and/or insufficient tutorials and open source implementations. There is a considerable redundancy in these resources, replicating MNIST experiment.

4.4 The Achilles’ heel of VAEs are the proper balancing of reconstruction error and KL divergence loss for latent space regularization. Without proper understanding, with insufficient resources a data scientist can easily find himself/herself delivering an implausible implementation.

4.5 Even though VAEs base themselves on certain solid theoretical foundations 1with strong a priori assumptions on data, as in all deep learning context in general, the real deal is in the empirical, practical understandings and manipulations. And for that, a data scientist/machine learning engineer should get himself/herself acquainted with intricacies of VAE theory.

4.6 To my experience, one of the most important aspects omitted in VAE design is the proper choice of activations. To follow similar vanilla examples on the internet, i went with the relu choice. Already, for example, wine dataset is somehow simple in correlations and small in dimension. Therefore, with relu activations, even with augmentations on latent space regularization, we end up having entangled manifolds as we see in our experiments. Generally, I observed that for proper tabular low dimensional embedding tasks, it is always preferable to use no activations, at least before and after one layer around the bottleneck. And to compensate relative loss of reconstruction performance number of layers can be increased.

5. References

[Kingma et al., 2013] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[Goodfellow et al., 2014] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems 27 (2014).

[Sohl-Dickstein et al., 2015] Sohl-Dickstein, Jascha, et al. “Deep unsupervised learning using nonequilibrium thermodynamics.” International Conference on Machine Learning. PMLR, 2015.

[Hinton, 2009] Hinton, Geoffrey E. “Deep belief networks.” Scholarpedia 4.5 (2009): 5947.

[Higgins, 2016] Higgins, Irina, et al. “beta-vae: Learning basic visual concepts with a constrained variational framework.” (2016).

[Akrami, 2022] Akrami, Haleh, et al. “A robust variational autoencoder using beta divergence.” Knowledge-Based Systems 238 (2022): 107886.

[Asperti, 2020] Asperti, Andrea, and Matteo Trentin. “Balancing reconstruction error and Kullback-Leibler divergence in Variational Autoencoders.” IEEE Access 8 (2020): 199440–199448.

[Dai, 2019] Dai, Bin, and David Wipf. “Diagnosing and enhancing VAE models.” arXiv preprint arXiv:1903.05789 (2019).

[Chen, 2018] Chen, Pengfei, Guangyong Chen, and Shengyu Zhang. “Log hyperbolic cosine loss improves variational auto-encoder.” (2018).

[Zhao, 2017] Zhao, Shengjia, Jiaming Song, and Stefano Ermon. “Infovae: Information maximizing variational autoencoders.” arXiv preprint arXiv:1706.02262 (2017).

6. Annex

--

--

Eren Ünlü
Nerd For Tech

Data Scientist and Machine Learning Engineer, PhD @ Datategy, Paris