To Infinity and Beyond: Making Sense of High-Dimensional Data with Scalable Deep Generative Models

The promise of Variational Autoencoders that flexibly scale and integrate multiple data types to understand complex phenomena

17 min readJul 17, 2023

As a neuroscientist, I’m fascinated how we can better understand brain states to improve the way people heal. Healing is process that takes many forms — healing is found through medicine and through community, through therapy and through addressing systemic inequities. But no matter the form, healing is ultimately an embodied process. Health problems (and their solutions) undeniably make their mark on the brain, changing how we think and feel.

There’s a lot of incredible work being done that looks toward the brain to tell us how we can develop more effective therapies to better heal people in need. One of the biggest obstacles to this approach is that the brain is highly multidimensional. It does so many things, we struggle to put all the puzzle pieces together. It can be incredibly challenging to identify unhealthy brain patterns, understand their causes, and then pinpoint what can and needs to change. Fortunately, we live in the age of big data. Data analysis tools have propelled our understanding of brain mechanisms to new heights and revolutionized how we interpret what is going on in brain.

One tool that has substantial potential to catalyze progress in this mission is the variational autoencoder (VAE), capable of learning vital information from datasets and generating new data. Combining two types of VAEs — multimodal and infinite-dimensional — promises the efficient capture of critical information about a wide range of brain states, or, more generally, any complex phenomena. By adaptively scaling and learning from datasets that include different types of information, infinite-dimensional multimodal VAEs have the potential advance development not only in brain-focused therapies but in a number of technical fields as well.

But what are infinite-dimensional, multimodal VAEs? And how can they be useful? Assuming basic familiarity with machine learning concepts, we’ll examine their building blocks and dive into the architecture and mathematics behind them. We’ll explore how they can be applied to solve problems related to brain health and ultimately discuss the challenges involved in successfully implementing this new VAE variant at scale.

What’s so great about VAEs?

Variational Autoencoders (VAEs) are powerful machine learning models that can learn to compress high-dimensional data, which is useful for tasks like classification, segmentation, and prediction.

VAEs are useful across a number of fields, like biomedical engineering, natural language processing, and robotics, to name a few. If we want to compress some data, detect anomalies, synthesize data, or make inferences about processes that generated the data, VAEs can be the state-of-the-art solution. Whether it’s for disease detection and diagnosis, generating synthetic medical images or patient data, image and video generation, restoration, and enhancement, document modeling, or state estimation and control of autonomous agents, VAEs are there for us ❤

What’s their aim?

Simply put, the aim of a VAE is to approximate a probabilistic data generation process.

Autoencoders in general aim to compress and decompress data by mapping higher-dimensional input data to a lower-dimensional but fine-grained representation that exists in latent space. Autoencoders are latent variable models, which assume that observable data is generated by unobservable (aka latent) processes. Variational autoencoders are additionally generative models, which means they are designed to generate new data that is similar, but not identical, to the original data. While autoencoders only learn a deterministic mapping between the input and the output (i.e. the same input yields the same output), VAEs learn a probabilistic mapping, such that they can generate new data by sampling from the latent space.

For example, a VAE that is trained on a dataset of images of dogs can generate images of dogs that have never existed in our world and sometimes feel uncanny. VAEs’ probabilistic nature also allows them to “fill in the blanks”, remove noise, and impute missing data.

Photo (or is it?) by andrew welch on Unsplash (…it’s actually a photo.)

VAEs accomplish their aim by estimating (1) the probability that the estimated latent variables co-occur with the dataset’s observable features (known as the posterior or encoder distribution), and (2) the probability that the estimated latent variables can generate some corresponding observed features (known as the likelihood or decoder distribution). Here, probability is represented as a distribution (not a percentage). It’s also useful to estimate the probability of variables that exist in latent space (known as the prior distribution).

Data flow through a variational autoencoder. The original input data exists in x-space. Z-space (aka latent space) is where compressed data lives. (credit: Kingma & Welling, 2019)

We say the encoder and prior distributions are approximated because their actual distributions are unknowable. In practice, a dataset almost never contains all the instances of the phenomenon it captures and we cannot be 100% certain that the latent variables estimated by the encoder actually generated our observed data, so we can only estimate the data generation process.

How do VAEs learn?

To understand more about how a VAE learns, we have to understand its three main components: the encoder, the decoder, and the evidence lower bound.

The encoder maps input data to a latent space representation. This is achieved by the approximation of the data’s encoder distribution q_ϕ. The encoder consists of multiple layers of a neural network with weights ϕ that transform the input data into latent space. The mapping the neural network learns is the encoder distribution. A latent space representation of a data point consists of a set of variables, or vectors of sufficient statistics that parameterize the distribution used for the latent space. This distribution is set a priori, before the VAE is trained. With traditional VAEs, a Gaussian distribution is often used, which yields an output of a mean vector μ and variance vector σ.
The decoder maps latent space representations back to the form of the input data. This is achieved by the approximation of the decoder distribution p_θ. Like the encoder, the decoder consists of multiple layers of a neural network with weights θ that transform the latent space representations back to the input space. The output of the decoder is a reconstruction of the original input data.
The evidence lower bound (ELBO) is a number that represents the generalizability and accuracy of what a VAE has learned. More technically, it is a loss function that measures the quality of reconstructions and the regularization of the latent space. The loss function includes two parts: (1) a reconstruction loss, which measures the difference between the original input data and its reconstruction, and (2) a regularization loss, which discourages overfitting of the latent space to the input data.

The architecture of a variational autoencoder. Here, hidden denotes the latent space. (credit: blog.fastforward.com)

Training a VAE involves gradient descent, where weights in the neural network layers of the encoder and decoder are sequentially adjusted to minimize the ELBO. Using the ELBO to adjust the weights ϕ and θ of the encoder and decoder effectively ensures that the approximated encoder distribution is as close as possible to the true (but unknowable) posterior distribution of the actual latent variables that generated the observed data.

VAEs learn to distill important features from input data by adjusting the weights of their encoder and decoder, with the objective of minimizing the evidence lower bound over noisy data.

The math behind the ELBO

Let’s talk a bit about the math that supports how the VAE learns. In the equation for the ELBO below, we see that the regularization loss is minimized by minimizing Kullback-Leibler divergence (D_KL; a measure of how different one probability distribution is from another) between our data’s approximated encoder distribution and prior distribution, and dependent solely on the encoder parameters ϕ. The reconstruction loss, or the negative sum of the log-likelihood (a measure of how well parameters in the model fit the dataset) of S random data points, is minimized when a latent variable z^_i can perfectly reconstruct its input data point x_i, and dependent solely on the decoder parameters θ.

The ELBO, given encoder parameters φ, decoder parameters θ, and an input data point *x_i*

VAEs shine when it comes to regularization of their latent space, which enables the generation of new data that looks real. A well regularized latent space should be (a) continuous, such that two points close together in the latent space (according to some distance measure, like Euclidean distance) should give similar output once decoded, and (b) complete, such that a point sampled from the latent distribution should give meaningful content once decoded. Regularization is achieved by the reparameterization trick, where a noise term ϵ is drawn is from a noise distribution and applied to the latent variables before decoding.

In the case of a Gaussian VAE we can reparametrize our mean vector μ and variance vector σ with our noise term ϵ to obtain the sampled latent vector z which is input to our decoder network.

Sampling a latent vector from a Gaussian VAE using the reparameterization trick

Now that we’ve covered how traditional VAEs learn meaningful, low-dimensional representations of data, let’s talk about some interesting VAE variants — multimodal and infinite-dimensional.

The Multimodal Variational Autoencoder

The multimodal variational autoencoder (MVAE) makes inferences from multiple data modalities (i.e. types of input data). A modality can be an image, a text, an audio recording, a brain signal time series, etc. With MVAEs, the latent space comprises a joint distribution over the multiple data modalities. Its architecture enables data fusion (integrating information from multiple data sources).

Data flow through a multimodal VAE capable of reconstructing the missing pixels and description of an image (Geng et al., 2022)

The Infinite-Dimensional Variational Autoencoder

Infinite-dimensional models can flexibly scale to represent complex, high-dimensional data. Stick-breaking variational autoencoders (SB-VAE) are an example of an infinite-dimensional VAE, which can learn a flexible and scalable distribution over high-dimensional data. What’s key here is the use of a stick-breaking process (also known as a Dirichlet process), which generates a probability distribution over an infinite-dimensional latent space.

The mathematics of infinite dimensions

For a deeper explanation, the stick-breaking process generates a sequence of weights that define a probability distribution over an infinite-dimensional latent space.

In this process, a stick of length one is broken into an infinite number of pieces (aka stick-breaking weights) π by repeatedly breaking off a random fraction. The fraction broken off in each step is decided by drawing random samples v from a beta distribution Beta(1, α), which has a mode of 1 and concentration parameter α. The pieces π are calculated as the cumulative product of all previously drawn random samples v, multiplied by the remaining length of stick. This results in a probability distribution that assigns more weight to the lower dimensions of the latent space.

A stick-breaking process, which can generate infinite stick-breaking weights π.

This process enables the latent space to adapt to the complexity of the input data while capturing vital information about its underlying structure. Further, the weights generated by the stick-breaking process can be used to partition the latent space into subspaces, where each subspace captures a different aspect of the data.

So how does an infinite dimensional VAE perform?

Let’s look an example of an SB-VAE trained on the MNIST dataset which consist of images of hand-written numbers (something of all of us are familiar with). We can see how well SB-VAEs capture subspaces in data and how it scales its dimensions to capture important features in the data. Sampling from Dirichlets of increasing dimensionality, we see how an SB-VAE finds structure in the digits. The sixth dimension seems to capture “1”, the second and third dimensions seem to capture ”7” and ”9”, whereas the eighth dimension seems to capture the ”3” class, and the seventh dimension models notably thick digits. Sampling from all 50 dimenions in the SB-VAE, we see a comparable output to sampling from a traditional (Gaussian) VAE.

MNIST numbers generated sampled from different dimensions in an SB-VAE’s latent space (credit: the author)

Applying clustering methods to the latent space of a SB-VAE demonstrates how these digits occupy separable subspaces in latent space. The image below shows the results of a tSNE projection, a way to visualize high-dimensional data in two dimensions, on a traditional VAE whose latent space uses a Gaussian distribution (left) and a SB-VAE (right), with each data point colored according to the digit it represents. Even in the resulting clouds of data points, it’s clear there the digits are more clustered in the SB-VAE vs. the traditional VAE.

tSNE projections of the latent spaces of a traditional Gaussian VAE (left) and a SB-VAE (right). Each data point is color coded according to the digit it represent in the MNIST dataset (credit: the author)

In a nutshell, multimodal VAEs model the relationships between different types of data and infinite-dimensional VAEs flexibly scale to capture underlying structure in complex data.

But how can these be used to understand the brain?

These VAE variants both have strengths that could provide hints to understanding hidden mechanics of brain, a major struggle for us in the neuroscience community! So far, there been any published brain research using infinite-dimensional VAEs and there’s been very little using MVAEs. That said, what we’ve seen in the this under-explored avenue of research gives hope for an exciting future.

If we consdier MVAEs alone for a second, we can see their potential to deepen our understanding of what goes on in the brain. MVAEs can be used to jointly analyze neuroimaging modalities like magnetic resonance imaging (MRI), positron emission tomography (PET), and electroencephalography (EEG) data and model the complex relationships between them. MVAEs have already shown promise in the analysis of neuroimaging data for mental health and psychiatry. Some successful (and potentially successful) applications include:

A sequence of structural MRI brain scans, which could be used as input data for a VAE (credit: the author)

Accuracy of diagnosis and classification of brain disorders can be improved through learning joint representations of several modalities vs. just one. A recent study used an MVAE to learn a joint representation of structural and functional MRI data to improve the diagnostic accuracy of disease staging in Alzheimer’s disease. This technology, further developed, could help doctors around the world better estimate and explain to patients how a disease will progress or remit.
Personalized treatments might be better planned based on an individual’s unique characteristics, such as their genetic profile, imaging data, and clinical measures. For example, joint representation of clinical data and structural MRI data might enable predictions about which patients with depression will most likely benefit from cognitive-behavioral therapy versus medication, or optimal brain stimulation targets for Parkinson’s disease. If successful, this could help clinicians deliver treatments with the highest chances of success, saving time, money, and grief for both patients and insurers.
Neural mechanisms that underlie mental disorders can be better espoused by investigating meaningful representations of the underlying structure of neuroimaging data. A recent study used an MVAE to learn a joint representation of resting-state functional MRI and diffusion MRI data to identify disrupted brain connectivity patterns in patients with schizophrenia. This hints toward the use of VAEs to uncover underlying neural mechanisms of neuropsychiatric disorders and inform the development of new interventions.
Generating synthetic imaging data can be used to share and learn from brain data in a way that protects the identities of the people who were scanned. I’ve previously discussed health data privacy as a growing concern for an increasingly data-driven world. A recent study generated realistic but synthetic imaging trajectories for Alzheimer’s disease using an MVAE trained on both imaging and non-imaging data. The development of methods that enable researchers to share data that contain relevant features for health and wellness and pose no possibility of being used to re-identify subjects would do a lot to prevent non-consensual use of sensitive data.

An infinite-dimensional multimodal VAE might further help identify patterns of brain activity associated with (un)healthy cognitive processes, altered states of consciousness, or future response to a therapy.

The architecture of the VAE also enables unsupervised and semi-supervised learning, where a model learns to represent data even when explicit labels are missing for all or some of the data. This could be particularly useful in the analysis of neuroimaging data where precise labels of brain states are often limited or difficult to obtain.

When it comes to understanding the brain, the most complex thing we’ve discovered in our universe, we need new perspectives and tools to understand how it organizes itself and structures knowledge to generate richness (and sickness) in our lived experiences. These few examples point to possibilities for progress. A beautiful thing is that the application of MVAEs doesn’t stop with brain imaging data — they can be used to identify core attributes, categorize different phenomena, and make predictions from practically any data.

The Challenge: Calculating the ELBO

It’s clear that using infinite-dimensional latent distribution with an MVAE could have some very powerful applications. So what’s stopping us from implementing one? In reality, combining MVAEs and infinite dimensional latent spaces might not be as easy at is seems. Things get pretty technical from here, so hold onto your seats…

The biggest challenge is finding an parameterization for a infinite-dimensional distribution whose multimodal ELBO has an efficient analytic solution.

Considering MVAEs alone, there are a few approaches to calculate the ELBO. Every approach mentioned below involves training unimodal encoders to learn an encoder distribution for each modality, but they differ in how they define the joint encoder distribution q(z|X), how evidence from each modality is weighted in their ELBO equations, and how computationally expensive they are (here, ordered from least to most).

In the Product of Experts (PoE) approach, the joint encoder distribution q(z|X) is calculated as a product of a ‘prior expert’ p(z) and ‘unimodal experts’ q(z|x_i), which are assumed to be conditionally independent of each other. To reduce computational cost during training, a sub-sampling method can be used, where the full ELBO can be calculated as the sum of smaller ELBO terms dervied from observations that are both whole (all modalities present) and partial (some modalities missing). In practice, we can sum (1) an ELBO over the full joint encoder distribution, (2) ELBOs over each unimodal term, and (3) ELBOs over k joint encoder distributions calculated from k randomly chosen subsets of modalities to get our final loss function.

The Product of Experts (PoE) ELBO (credit: Wu & Goodman, 2018)

The Mixture of Experts (MoE) approach alternatively formulates the joint encoder distribution as a mixture, or a sum, of unimodal encoder distributions. In the ELBO, this breaks down into the sum of two terms calculated from a sampled latent data point z_i from every modality. The first term is the log of the joint decoder distribution p_θ(z, x_1:M) divided by the approximated joint encoder distribution q_ϕ(z|x_1:M). The second term can be understood as importance-weighted estimate of the log-likelihood of the latent variables given the weights of that modality’s encoder distribution q_ϕ(z|x_i).

The Mixture of Experts (MoE) ELBO (credit: Shi et al., 2019)

The Mixture of Product of Experts (MoPoE) fuses concepts from both these methods, first combining unimodal encoder distributions as a PoE, then computing an MoE over all subsets of those distributions. The joint encoder distribution in the regularization term is then the mixture of products (of subsets) of experts q_ϕ(z|X_k). The other ELBO term can be decomposed as a sum of marginal log-likelihoods of individual sets of experts.

The Mixture of Product of Experts (MoPoE) ELBO (credit: Sutter et al., 2021)

Roughly speaking, when we use a PoE ELBO, we teach our MVAE about what is mostly unanimously agreed upon by ‘experts’ that have each learned the latent space distributions of one specific type of data. With MoE, we take a more inclusive tally based on our experts’ individual knowledge. With MoPoE, we do something in between.

Sounds great! So what’s the issue?

Because we must take products/quotients of our a priori distributions for all these approaches, the solution for calculating an ELBO is only tractable when the approximated encoder and decoder distributions are both Gaussian. The product/quotient of two Gaussians is a Gaussian whose parameters can be easily calculated. However…

When building an infinite-dimensional VAE, we cannot use Gaussian distributions alone. We need to use differentiable non-centered distributions that can approximate the Beta distribution used to generate the length of each broken piece of stick in the stick-breaking process. Finding a tractable product or quotient over an eligible distribution (like a Logit-Normal or Kumaraswamy distribution) is not trivial but it is key to calculate the multimodal ELBO.

Further, the PoE approach also suffers from issues when handling missing modalities — sub-sampling from unimodal log-likelihoods to generate multimodal data does not guarantee a valid lower bound on the joint log-likelihood. With PoE, each expert has strong veto power — if even one of the unimodal encoder distributions has low density for a given set of observations, the entire joint distribution will have low density. In other words, the approach is more prone to fall apart with the inclusion any modality that is uncertain or noisy.

On the other hand, MoE only uses unimodal encoder distributions during training and essentially takes a vote among experts, spreading the density of their joint distribution over all the individual experts. However, it does not have a strong capacity to generate multimodal data. It can effectively learn unimodal encoder distributions and translate one modality into another, but compared to the other models, a higher log-likelihood is only achieved when there is one input modality present. It’s also a bit more computationally expensive that PoE as well, requiring M² passes over the model to calculate the joint generative model.

The MoPoE serves to circumvent the weaknesses of both, but calculating a joint encoder distribution for every subset of modalities requires 2^M passes over the model and can become prohibitively expensive when dealing with an increasing number of modalities.

Each method to calculate the multimodal ELBO comes with trade-offs between how well joint information is integrated, how well new data is generated, and how computationally efficient the calculation is.

Without a way to easily calculate the ELBO over an infinite-dimensional multimodal VAE, the widespread adoption of this seductive variant of VAEs remains out of reach. But finding a solution offers the promise of a powerful, scalable inferential tool, for anyone who’s up to take on the challenge.

The takeaway

In the age of big data, we’re often faced with more data than we know what to do with. Neuroscience and many other technical fields that rely on data analysis can benefit from new methods to make useful inferences from a plethora of data sources. An exciting direction of research is VAEs, a deep learning model that can learn meaningful, low-dimensional representations of data and approximate data generation processes. Variants of the VAE offer unique advantages — multimodal VAEs can learn relationships between different types of data and infinite-dimensional VAEs can flexibly scale to capture crucial information from data. Combining these variants, infinite-dimensional multimodal VAEs promise improved inferential power when we look for parsimonious explanations for our data.

While infinite-dimensional multimodal VAEs have great potential to deepen our understanding of what generates phenomena of unknown complexity (like what goes on in someone’s head), progress in this research direction will require the further development of computationally efficient ways to formulate how they learn.

Generally, research into deep learning methods that can leverage joint information across multiple modalities and scale to accomodate data with an unknowably complex underlying structure could lead to progress in many fields. With regard to mental health and psychiatry, it may advance our understanding of mental disorders, improve diagnostic accuracy, and inform clinical decision-making, ultimately improving patient care.

But let’s not forget the determinants of mental health don’t just exist in the brain — they exist in the fabric of society as well. Health disorders arise from systemic factors like structural poverty, discrimination of sexual orientation, and racism. And the improvement of clinical treatments can only do so much to change structural inequity. Still, I’m hopeful that the incorporation of systemic factors in models of brain health can provide evidence of where interventions need to happen outside of the body, in order to most effectively treat the root causes of brain illness. A future where we can make more sense of health data with powerful models like VAEs and use those findings to promote health equity is a hopeful one.

Hi, I’m Adú, a data scientist who’s passionate about improving mental health awareness, accessibility, and treatments. Have questions or want to connect? Feel free to contact me at adu@adumatory.com.