How to find good hyperparameters for Soft-IntroVAE — A 2D toy data set parameter search ablation

Sebastian Vater
Fourthline Tech
Published in
9 min readMar 5, 2024

Everybody who ever worked either with VAE or GAN-like algorithms knows this: the struggle of finding proper hyperparameters. While this hyperparameter search, when reading papers, seems to be as doable a job as the implementation of the algorithm itself, hands-on people know from experience it is not.

This blog is aiming to provide insights into a more visually intuitive understanding of the role that the hyperparameters play with respect to the different loss terms of a Soft-IntroVAE model.

Note: The article makes use of mathematical writing. For the best reading experience, we recommend using Chrome and installing the TeX to Unicode extension if you don’t have it yet.

Visualizations of (results of) hyperparameter searches being conducted in works of different approaches: beta-VAE, EBGAN, Soft-IntroVAE (from left to right; pictures are screenshots from the respective papers.).

Though a lot of advancements in the field (Generative Adversarial Networks: EBGAN, BEGAN, Stabilized GAN training, WGAN, ProGAN, MSG-GAN; Variational Autoencoder-like: beta-VAE, Intro VAE, Soft-Intro VAE) are made towards more robust training, this hyperparameter search is tedious in the best case, a nightmare in a more realistic case. In this context, we understand a robust training to be a series of successful (and) convergent of model trainings without being sensitive to slight changes in the hyperparameter values.

Advances being made regarding training stability. (A screenshot being taken from the MSG-GAN Paper.)

Following the taxonomy above, and subdividing modern high-resolution image creation algorithms into VAE- and GAN-like algorithms, we want to focus on the VAE-like Soft-IntroVAE: Analyzing and Improving the Introspective Variational Autoencoder algorithm here.

The reason for this choice reason is two-fold: on the one hand we want to give credit to the original authors of the paper as this blogpost was clearly inspired by the 2D toy data set studies being conducted in their work. Given their very valuable published findings, it is worth conducting an even more comprehensive experiment.

The second reason is that we at Fourthline indeed found that the Soft-IntroVAE is more robust in terms of training compared to other approaches as mentioned above (all of which we worked with and tested).

Extensively seeking good hyperparameters (or even any hyperparameters that converge) is not just a tedious, but, given training times for reasonable data set sizes, often unfeasible job. Although different values are found and reported for different datasets in the original dataset, this blog post aims at deducing some general rules of relationships of hyperparameters for training Soft-IntroVAE models that can be applied to training, ideally agnostically to the data domain. We hope this helps developers understanding better the behavior between the different loss terms (of which Soft-IntroVAE has alot) and its corresponding hyperparameters.

Precap: This blog assumes a thorough understanding of Variational Autoencoders and its popular derivatives, particular the Soft-IntroVAE.
If you are not familiar with it, you might want to study them before continuing in this blog.

2D toy data set from the original paper

In the original paper, Table 2 compares the best results in terms of ELBO scores for different VAE-based approaches(VAE, IntroVAE, Soft-IntroVAE) resulting from an extensive grid search. The conducted 81, 1260 and 210 parameter sweeps, respectively.

They visualize the best results in terms of 4 different and known image space distributions (which are 2D points here):

Visualizations of best fits of known training data distributions with the best found hyperparameters for different models as published in the Soft-IntroVAE paper.

Table 2 shows the best results being obtain on toy data. While the paper indeed reports different optimal hyperparameter setting for different datasets, interesting questions to still be raised are:

  • How does the ELBO (or any other objective) manifold look like for a vast amount of hyperparameter combinations?
  • How does this manifold behave in terms of smoothness and value ranges?

The following paragraphs should give an answer to these questions, followed by a discussion.

Recap Soft-Intro VAE objective

We do not want to revise the heritage of the derivation of the loss function of the different VAE-based objectives that brilliant research has brought forward. Building upon the VAE and Intro-VAE propositions, we will focus directly on the Soft-IntroVAE objective; its pseudo code is found below.

Pseudo code of the Soft-IntroVAE algorithm. Besides the traditional hyperparameters in VAE-like models such as latent space size, KL loss weight and reconstruction loss weight, we additionally have beta_neg and gamma_r. (Here we treat the latent space dimension as a hyperparameter rather than a network architecture choice.)

As you see, there are at least two additional hyperparameters (compared to conventional VAEs) that the authors of the Soft-IntroVAE paper put onto the table:

For that reason the graphs below are projections of the higher dimensional spaces (4-dimensional: β_rec, β_kl, β_neg, latent_space) onto the visualizable 2-dimensional space. Here, the projections performed are simple statistics such as the minimum value of a specific loss term among all other values of the remaining parameter. This is being part of the discussion below.

Experimental setup

The following are the results of 1489 trainings, where we want to fit the model to a superposition of 8 2D uniform Gaussians:

Test data for this blog post: Superposition of eight 2D uniform Gaussian distributions.

We vary

while keeping the latent dimension equal to 2 and

(as mentioned in the paper).

Here we perform a grid search over the parameter space. Note, that we want to gain intuition of how parameter sweeps influence the loss manifolds for the different loss terms. For non-brute force hyperparameter optimizations we refer to this source for an overview.

In the original paper, the authors wanted to explore and compare the different approaches on a toy dataset. Since we want to try to deduce general training rules for the training of actual relevant datasets, we do not alter the proposed Soft-IntroVAE architecture. This is in comparison to the papers’ investigation, where a 3 layer MLP is utilized on 8-Gaussians dataset.

Reconstruction loss: β_kl vs. β_rec

We start by looking at the two classic VAE losses — the KL loss and the reconstruction loss. As we are (now, in Soft-IntroVAE) in a higher (e.g. 4-dimensional) parameter space, we obtain multiple values for each (kl, rec) pair, given the sweeps through the β_neg space, as compare to a 2-dimensional optimization.

The first figure depicts the minimum along different values for β_neg of the logarithm of the reconstruction loss for the encoder of the real images (line 8 in Algorithm 1), plotted against sweeps of β_rec and β_kl between 0.0 and 1.0.

Reconstruction loss of the real images, computed by backpropagating through the encoder. Large values for β_rec . The binary map at the floor indicates successful(white) and collapsed or unsuccessful (black) trainings.

At the bottom of the figure we show regions where combinations of the hyperparameter tuple (β_rec, β_kl, β_neg) lead to NaN values for the loss — i.e. unsuccessful training:

Unsuccessful training with β_rec = 0.1 and β_kl = 0.1 corresponding to the red star in the above figure. The *.gif shows the training with lowest reconstruction loss L_e_E over a β_neg sweep.

What do we learn from this graph? Lower kl values lead to (the possibility of) lower losses — some β_neg values for the same β_kl, β_rec parameters result in very good reconstruction losses, other values for β_neg yield an unstable training.

Correcsponing to… and … note that this point is the min over all beta neg sweeps

The Soft-IntroVAE objective computes two more interesting reconstruction losses as described in line 10 in the algorithm: The loss of the reconstructed of the reconstructed images, L_rf_E (E for encoder), and loss of the reconstruction of the generated images, L_ff_E. They are depicted below in the left and right figure, respectively. As above, we show the losses measured over the backpropagation through the encoder.

Reconstruction loss of the reconstructed real images (left) and reconstruction of generated images (right). Again, computed by backpropagating through the encoder.

We can see that the losses behave very similarly. In fact, we found this behavior in all our experiments across different datasets, including several image domains. For that reason, for the remainder of this blog, we always mean L_r_E when we talk about the reconstruction loss.

Choosing a ten times higher reconstruction loss compared to the *.GIF depicting the unsuccessful training above and thereby moving towards the white region of L_rf_E the model fits the data successfully:

Successfully fitting the data with β_rec = 1.0 and β_kl = 0.1 and β_neg = 0.01. Note, that this is not the best possible training, but shows what happens by changing β_rec moving towards hyperparameter spaces of successful training, indicated by the white regions in above figures.

Kullback-Leibler loss: β_kl vs. β_rec

What happens at the same time with the KL loss? A low reconstruction loss is desirable, but a ML practitioner training VAE’s knows that this does not necessarily coincide with a proper training: Low reconstruction loss can mean you reached a trivial solution, which means posterior collapse (only one image is/a few similar images are reconstructed from any embedding). Therefore we need to look also at the KL loss: We do not want a too high KL loss, in which case the embeddings are encoded following a density that is far away from your prior. If you embed to far from your prior and then, in the process of generating new images, draw a sample from that known prior to run it through your decoder, your decoder cannot properly handle this input and the generated images will show an insufficient quality (if they show anything meaningful at all). Depicted below is the minimum of the log of KL_z_E.

KL loss of the encodings of the real (training) images.

Comparing the KL loss to the reconstruction loss above we find the following. Firstly, of course NaN regions are the same — an unsuccessful results in unstable losses for both, the reconstruction as well as the KL.

Secondly, we look at the shape and slop of this loss manifold. While we had good regions for a low reconstruction loss particularly for low β_kl values, we see a conflictive behavior here: low β_kl values result, as expected, in high KL losses.

Combining both loss surfaces, good training regions seem to lie in rather high β_kl values and moderate β_rec values. That said, there exist values for β_neg that yield low total loss values (remember: the plot shows the minimum KL loss among different β_neg values).

So let’s look at the β_neg hyperparameter.

Reconstruction loss: β_rec vs. β_neg

The minimum reconstruction loss for different β_kl values, plotted over the 2-dimensional (β_rec, β_neg) space is shown below.

Apparent is the ridge at a low value of β_neg across the entire β_rec domain. In indicates clearly that — letting alone β_kl — there exists a sweet spot for β_neg (for this data set). Increasing it will lead to less good reconstruction errors but will remain a stable training as the surface remains smooth; decrease from that sweet spot leads to unstable training rapidly, also indicated by the NaN regions at very low β_neg values.

Kullback-Leibler loss: β_rec vs. β_neg

The KL loss appears stable for larger β_kl and β_neg values. A very strong decrease of the kl loss can be seen going towards unstable regions for low β_kl values (see above).

Two views if of the same Kullback-Leibler loss surface of the encodings of the real (training) images, plotted over β_rec and β_neg.

The area for higher β_neg values and moderate β_rec values promises stable training with a tendency of lower loss values. This is a behaviour we can also confirm from extensive experiments with other data sets and hyperparameter combinations.

Discussion

Of course one can argue: Why such an ablation, why deducing “intuition” when we have an analytical description of the problem? Why not using non-brute force hyperparameter optimization methods?

Well, there are at least two reasons:

  1. In Soft-IntroVAE, we have at least 4 different hyperparameter. By means of the curse of dimensionality and the human limitation of imagination the goes beyond 3D, visualization and simplification by reducing dimensionality might give insights helpful for understanding the interaction between the losses and hyperparameters.
  2. The ELBO terms have non-linear, particular exponential terms. Humans cannot properly deal with exponential behavior when it comes to prediction, inter- or extrapolation in a natural way.

Next steps

Further relationships that can be explored are the behavior of the reconstruction loss as well as the Kullback-Leibler loss of KL_z_E, by plotting them over β_kl and β_neg.

And the most important and final question: How do these figures look like for other data sets? We hope that the insights won on the 2D Gaussian dataset indeed generalize to other datasets and to apply them to our real world datasets! See you soon!

References

--

--