Topology of a latent space: What can go wrong with the representation in a deep latent variable model like VAE or Normalizing Flow?

Written by Piotr Tempczyk

Piotr Tempczyk
Acta Schola Automata Polonica
11 min readAug 9, 2020

--

Before we start

This blog post requires some knowledge about Bayesian probability theory, machine learning, and topology. I will give some references to Wikipedia articles or other online sources whenever I use a terminology out of the machine learning domain. I assume a reader has some basic knowledge of machine learning, neural networks, and generative models like GAN or VAE.

This is the first from series of blog posts about topology of the latent space. In this blog post I am going to describe and define a potential problems with the latent space topology. In next blog posts I am going to present experiments in which I am going to verify how does problems affect models and representations trained on a real-world and synthetic datasets. All experiments were conducted using Python library for generative models genlib.

Why even care about the representation?

During my work on a real estate property valuation model, we wanted with a research team to have some insights into the model structure and dependencies between the input and the output of our model. My idea was to:

  1. create a generative model for the data (my first try was Variational Autoencoder — VAE),
  2. use this generative model to encode a data point for which we wanted to explain model predictions into the latent space,
  3. sample some points around it in the latent space,
  4. transform them back into the data space…
  5. …and see how our predictions are changing if we move in a direction where one of our features in data space changes the most.

My intuition was that we need such a model because of the high correlation between some of the variables describing a flat. For example, flat size and number of rooms are strictly correlated, as higher flats have usually more rooms. In our analysis, we wanted to exclude flats that can’t exist (that are unlikely or don’t exist in the data distribution), e.g. the flat with size of 20 square meters and with 10 rooms.

And then it struck me: is it possible to represent a discrete feature in the data space in the Euclidean latent space and at the same time have a possibility to sample points from that continuous space without drawing points that are out of distribution in the data space?

Desired properties of a representation in the latent variable deep generative models

In my case and many other situations we are interested in both: generating good samples from the model and having a convenient representation in the latent space. Expressing model ”goodness” in a more formal way, it is useful in many situations to have a generative model which has all of the following properties (the list is inspired by Chapter 3 of [1]):

G.1 We can sample new data from the model.

G.2 We can encode any point x from the data distribution into the latent space Z using our model (e.g. we cannot do this with GAN, which is an example of a latent variable model but without encoder).

G.3 Representation in the latent space is disentangled [1], which means that each factor of variation in the data space X is represented by one coordinate in the latent space Z.

G.4 We want to be as close as possible to the situation, where each point z in the latent space represents one point x in the data space, so we have 1 to 1 correspondence between X and Z (it is also called cycle-consistency). It can’t be done exactly if we also want to fulfill requirement G.7.

G.5 Point sampled from Z cannot represent a point in X that doesn’t belong to the distribution of X.

G.6 Latent dimensions should be smooth. It means that if we for example encode a hair color from light to dark blond we want a distance in the latent space representation to be proportional to a distance in the data space for every point in the data distribution.

G.7 We also want small and meaningless noise in data not to be encoded in the latent space (e.g.noise from an image sensor in photographs). This point prevents us from fulfilling the requirement G.4.

G.8 Generated data covers all possible data points (lack of a mode collapse effect, as observed in [13] and [2]).

Latent variable models

Figure 1: Latent variable model.

In this blog post, we will use term Latent variable models as probabilistic graphical models, where a latent variable z is drawn from a p(z) distribution and then the observation x is drawn from p(x|z). Bayesian network of this model is shown in Figure1.

If we allow p(x|z) to be a deterministic function of z (i.e. there is only one value of x for each value of z with probability 1), then all three main types of latent variable deep generative models (VAE[8][12], Normalizing Flows[11](NF) and GAN [5]) fit into this framework.

Properties of functions approximated by neural networks

To better understand how topology of a latent space can affect a representation and sampling process of a generative model we have to take a look at the functions we use to approximate our relation between variables x and z. We use neural networks in deep generative models, and because of that, there are some limitations for the class of functions f they can approximate. They arise from the neural network structure and properties of functions used as neural networks building blocks. These limitations are:

N.1 In many practical cases f is bounded or can be bounded without loss of generality.

N.2 f is smooth almost everywhere.

N.3 Derivative of f is bounded almost everywhere, so you cannot approximate functions like 1/x or step function with infinite precision (Lipschitz-continuous).

These properties impose some restrictions on transformations between the X and the Z space we can approximate using neural networks. For example, we cannot use neural network to approximate an arbitrary bijection between two manifolds because neural network is a continuous function, so we are restricted to homeomorphisms between manifolds.

I encourage the reader to read more about this topic in a blog post written by Christopher Olah about Neural Networks, Manifolds and Topology.

Manifold hypothesis

Many dimensionality reduction algorithms (e.g. PCA) assume that the data is sampled from a low dimensional manifold of dimension M embedded in high dimensional space of dimension K. In machine learning this is known as the manifold hypothesis [4]. But it is exactly true only for the data collected without any noise coming from a collecting device (for example a camera) or collection process and I am going to show why is that in this chapter.

In many cases we can assume that the data can be perfectly described by a vector of real numbers. For example what we see on a photograph of a non symmetrical rigid body with a static light source on white background can be summarized as a latent vector of length 6 (two 3-dimensional vectors for a body position and orientation — camera position and orientation is fixed). If an image is generated on a computer it will be always the same for the same latent vector. If we use a real camera and a real object with a real light source we will have a small noise added to each pixel value and two photographs for the same latent vector will no longer be exactly the same.

But what is the dimensionality of this manifold for a rigid body example generated on a computer? The answer is simple: it is the same as the latent vector dimensionality. This is because a computer generating this 2D view uses a deterministic smooth function to generate a photograph of the body and we can create an inverse function from a photograph to a latent space. These two functions generate a homeomorphism between the manifold of all possible object positions and orientations and the manifold of all possible photographs. And because of that these two manifolds have to have the same dimensionality.

It is the same case as when you create a parametric curve. You can create 1000 equations describing your manifold but with one parameter and using a continuous function you can only create 1D manifolds embedded in this 1000D space.

Figure 2: (left) 1D manifold embedded in 2D space, (right) the same manifold with Gaussian isotropic noise added.

But the case with noise added to each pixel is different. When you add a K dimensional Gaussian isotropic noise to each element of your vector, your data dimensionality of the data manifold changes from M to K. It is visualized for 1D manifold in 2D space in Figure 2. Because in generative models with the encoder-decoder architecture we try to learn an invertible continuous mapping between the latent and the data space we cannot create mappings between some types of sets and manifolds.

For example we cannot create a continuous invertible mapping that will transform a circle into a line or will transform a set consisting of 2 points into a set consisting of 5 points.

What is the true latent space distribution?

Just one more digression about generative models and we come back to the main topic od this blog post. When thinking about latent variable model we often say, that we want to recreate the ”true” latent space structure. But it is not obvious what does it really mean. Let’s assume that we have two data generating processes:

and

The first process described by Equations (1) is the original data generating process and the second one described by Equations (2) is our model of this real process. If we sample from each of them we obtain a set of points that are exponentially distributed along the line x₁ = x₂.

If we want to create a generative model that will recreate the original data generating process we have to assume something about the latent space distribution and then fit a function from t to x. And because we can only observe the data space, each of the models is equally good to describe the data. This holds for every model as long as there can be created an invertible function using neural network which can transform original latent space distribution into our one. This is why we cannot say which model represents the original data generating process because two of them fit the data perfectly.

But if the mapping between original and our latent space distributions cannot be created, we may experience problems with some aspects of our generative model.

Problems with generative models

Issues described in previous chapters can lead to some problems with latent variable deep generative models (like VAE or NF). These issues have their origin in restrictions imposed by the topology of the data manifold, the latent space manifold and the restrictions of homeomorphisms that can be learned by the neural networks. These problems are:

P.1 If an original latent space contains variables that are cyclic in nature (e.g. a rotation angle) they cannot be mapped into a real line on which a normal distribution (or any other distribution defined on a real line) is defined. This problem was addressed in [3] for example.

P.2 We also cannot represent a discrete distribution (e.g. digit type in MNIST) on a real line, so you cannot fulfill G.4-th property of a good generative model when using a real domain for all of your latent variables. Moreover if you want to represent a discrete distribution in your model you need exactly the same number of discrete values in your latent space.

P.3 Even when all N of your generating factors are defined on a real line but your latent space is not a vector of length N you cannot fulfill properties G.4 and G.5. If your vector is too small, you cannot construct all the data points from them, so your reconstruction error grows. And if your latent space is too big, it can lead to generating out-of-distribution samples or many points in Z will represent the same data point. We are going to see both behaviors in my experiments.

P.4 Problem can be even worse when you want to create a generative model on a data set with a varying number of generating factors between data points. This potentially may cause problems when trying to train a model with a fixed size latent space.

Many of these problems arise when you try to train an encoder-decoder architecture and this may be one of the reasons why GANs are so good at generating samples. They have to train only decoder, thus they don’t have to worry about learning homeomorphism between data space and the latent space. They only have to learn a surjection from a latent space to a data space. NF models may be good at generating faces because the number of data generating factors for faces is roughly fixed for each face and almost all of them can be mapped to distributions that lie on a real plane (for example length and color of the hair).

What next?

In the next blog post we take a look at how VAE behaves when trained on low dimensional synthetic manifolds in data space and what problems we can run into when playing with them.

If you enjoyed this post, please hit the clap button below and follow our publication for more interesting articles about ML & AI.

References

[1] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.

[2] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie
Li. Mode regularized generative adversarial networks. arXiv preprint
arXiv:1612.02136, 2016.

[3] Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M
Tomczak. Hyperspherical variational auto-encoders. arXiv preprint
arXiv:1804.00891, 2018.

[4] Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing
the manifold hypothesis. Journal of the American Mathematical Society,
29(4):983–1049, 2016.

[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative
adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,
and K. Q. Weinberger, editors, Advances in Neural Information Processing
Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.

[6] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot,
Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae:
Learning basic visual concepts with a constrained variational framework.
Iclr, 2(5):6, 2017.

[7] Jiseob Kim, Seungjae Jung, Hyundo Lee, and Byoung-Tak Zhang. Encoder-
powered generative adversarial networks. arXiv preprint arXiv:1906.00541,
2019.

[8] Diederik P Kingma and Max Welling. Auto-encoding variational bayes.
arXiv preprint arXiv:1312.6114, 2013.

[9] Soheil Kolouri, Phillip E Pope, Charles E Martin, and Gustavo K Ro-
hde. Sliced-wasserstein autoencoder: An embarrassingly simple generative
model. arXiv preprint arXiv:1804.01947, 2018.

[10] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and
Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644,
2015.

[11] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.

[12] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochas-
tic backpropagation and approximate inference in deep generative models.
arXiv preprint arXiv:1401.4082, 2014.22

[13] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec
Radford, and Xi Chen. Improved techniques for training gans. In Advances
in neural information processing systems, pages 2234–2242, 2016.

[14] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.

[15] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired
image-to-image translation using cycle-consistent adversarial networks. In
Proceedings of the IEEE international conference on computer vision, pages
2223–2232, 2017.

--

--