Summary: Beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework

Robert L. Logan IV
UCI NLP
Published in
4 min readNov 5, 2018

Authors: Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, Alexander Lerchner

Paper Link

Variational autoencoders (VAEs) are a popular framework for learning generative models of data [1]. The model is composed of two parts: an inference network which maps samples from a dataset into a latent variable z, and a generative model which decodes the latent variable back into the original data space. One interesting question we can ask is: what generative factors of the data (e.g. position, size, and rotation of a shape in an image) are being captured by the latent variable? In general, this can be difficult to determine as these factors may be encoded in multiple interdependent components of z. One proposal for addressing this issue is to force z to learn a disentangled representation of the data (e.g. to force the components of z to be independent). This paper demonstrates that disentanglement can be achieved by making a simple modification to the VAE learning objective, as well as establishes a protocol for measuring disentanglement learnt by a model.

The main idea is quite simple. In the standard VAE an isotropic Gaussian (p(z)∼𝓝(0,I)) is typically assumed as the prior distribution for z. Note that under this distribution the components of z are independent (e.g. disentangled) which is exactly the property we would like our approximate posterior distribution (e.g. q(z|x)) to have. Thus, to encourage independence we upweight the KL-divergence term in the ELBO by a factor β:

Training is then performed exactly the same as for the standard VAE.

The following figure compares the output of a β-VAE (with β=250) to a regular VAE as well as an InfoGAN model [2] trained on the CelebA dataset [3].

Figure 1 from the paper.

As can be seen, the β-VAE is able to capture interpretable factors such as rotation, and smile better than the standard VAE and a state-of-the-art InfoGAN model. However, the images are also considerably blurrier, which is understandable since upweighting the KL-divergence reduces the flexibility of the posterior distribution.

While examples like the one above are useful for illustrating qualitative differences between generative models, it can be hard to discern precisely how much better a given model is at capturing latent factors in the data. In order to more rigorously quantify the level of disentanglement, this paper introduces the disentanglement metric score, which is computed as follows:

  • Start with a known generative model that has an observed set of independent and interpretable factors (e.g. scale, color, etc.) that can be used to simulate data.
  • Create a dataset comprised of pairs of generated data for which a single factor is held constant (e.g. a pair of images which have objects with the same color).
  • Use the inference network to map each pair of images to a pair of latent variables.
  • Train a linear classifier to predict which interpretable factor was held constant based on the latent representations. The accuracy of this predictor is the disentanglement metric score.

The table below provides this score for a number of different models on a dataset comprised of 2D shapes:

From Figure 6 in the paper.

As can be seen, disentanglement metric score is higher for the β-VAE than most other baselines, with the exception of the DC-IGN (another VAE-based model that requires latent factors to be known a priori during training to encourage disentanglement) [4].

Overall I think that this work does a really good job establishing rigorous foundations for the problem of learning disentangled latent representations of data. While the β-VAE will probably not be used in practice for generating images (modern GANs produce much more realistic outputs), it does provide an effective and more importantly simple baseline model for the task. Meanwhile, the disentanglement metric score is a reasonable approach for comparing different models, and leaves open a bunch of interesting problems for future work such as: combining this score with quality based evaluation metrics such as the inception score [5], and coming up with more challenging datasets that are able to better compare the effectiveness of models for learning disentangled representations.

[1] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[2] Chen, Xi, et al. “Infogan: Interpretable representation learning by information maximizing generative adversarial nets.” Advances in neural information processing systems. 2016.

[3] Liu, Ziwei, et al. “Deep learning face attributes in the wild.” Proceedings of the IEEE International Conference on Computer Vision. 2015.

[4] Kulkarni, Tejas D., et al. “Deep convolutional inverse graphics network.” Advances in neural information processing systems. 2015.

[5] Salimans, Tim, et al. “Improved techniques for training gans.” Advances in Neural Information Processing Systems. 2016.

--

--