Variational Recurrent Neural Networks — VRNNs

Published in

AIGuys

9 min readSep 3, 2021

If you want to model reality, then uncertainty is what you can trust the most to achieve that.

Image has no direct relationship with the content. It’s just that I am an ardent nature lover.

In this blog, we are going to explore an insightful merger of two significant stars in deep learning — Recurrent Neural Networks(RNNs) and Variational Autoencoders(VAEs).

The topic requires a good level of intuition and it will really be helpful for you to better understand and visualize what’s actually happening under the hood if you have a fundamental understanding of sequential modeling and generative learning, and this blog assumes that the readers are familiar with random variables, latent random variables, conditional and joint probability distributions, variational inference, etc. In short, an understanding of RNNs and VAEs is required.

Before moving ahead, let’s quickly go over the symbols used in this blog: h_t denotes the hidden state of RNN at timestep t, h_(t-1) denotes the hidden state at time step t-1, x_t denotes the input at time step t, z_t denotes latent random variable at timestep t.

So, let’s get started.

The Need And The Background

This seminal work is the result of the work of Junyoung et al — A Recurrent Latent Variable Model For Sequential Data

First of all, Why VRNN? — It’s the result of the attempt to include the latent random variables into the hidden state of the RNN by combining the elements of the variational autoencoder.

Why ‘RNN’ for ‘VRNN’?

Learning generative models for sequences is a very challenging task. Significant work in this direction exists because of Dynamic Bayesian Networks (DBNs) such as Hidden Markov Models (HMMs) and Kalman Filters, but the dominance of DBN-based approaches has now been recently overturned by an interest in the recurrent neural network-based approaches. We know that RNN is very special in the sense that it is able to handle both the variable-length input and output and, by training an RNN to predict the next output in a sequence, given all the previous outputs, it can be used to model joint probability distribution over sequences.

RNNs possess both a richly distributed internal state representation and flexible non-linear transition functions (which determine the evolution of the internal hidden state) giving them high expressive power and as a consequence of which RNNs have gained significant popularity as generative models for highly structured sequential data such as natural speech.

By highly structured data, the authors meant that the data is characterized by two properties. Firstly, there is a relatively high signal-to-noise ratio, meaning that the vast majority of the variability observed in the data is due to the signal itself and cannot reasonably be considered as noise. Secondly, there exists a complex relationship between the underlying factors of variation and the observed data. For example, in speech, the vocal qualities of the speaker have a strong but complicated influence on the audio waveform, affecting the waveform in a consistent manner across frames

A very fundamental breakdown of the RNN for modeling purposes is given as:

Transition function that determines the evolution of the internal hidden state:

where subscript theta is the parameter set of the transition function f.

2. Sequential modeling by parameterizing a factorization of the joint sequence probability distribution as a product of conditional probabilities:

where g is a function that maps the RNN hidden state h_(t-1) to a probability distribution over possible outputs, and subscript symbol tau is the parameter set of g.

The Problem With Regular RNN

Non-determinism is something very important if you want to capture the variability or randomness in the data, which in turn lets you generate the desired fundamental distribution which characterizes your data (here, sequential data).

From VAE, we know that sampling from the distribution over the latent random variables z, is what introduces the non-determinism in the VAE, which in turn let’s us to capture the variability in the data and leads to the generation of desired output distribution defining that variation of the data.

Basic architecture of a Variational Autoencoder

But the internal transition structure of the RNN is entirely deterministic. The only source of randomness or variability in regular RNN is found in the conditional output probability model (which include either a unimodal distribution or a mixture of unimodal distributions), which lacks in modeling the kind of variability observed in highly structured data, such as natural speech, which is characterized by strong and complex dependencies among the output variables at different timesteps. To support the above saying, experiments also show that just the regular RNN is not very good at modeling the variation of such type of data.

Therefore, there is a need of incorporating randomness or non-deterministic factor in the hidden state of the RNN so that it can hopefully capture the required variability in the sequential data along with capturing the temporal dependencies.

Why Variational Autoencoder — The ‘V’ in ‘VRNN’

A variational autoencoder (VAE) is a very good example of a deep generative probabilistic graphical model that does a good job of capturing the variability in the input data and generating the distribution over them summarizing that variability. It offers an interesting combination of highly flexible non-linear mapping between the latent random state and the observed output and effective approximate inference. It can model complex multimodal distributions, which will help when the underlying true data distribution consists of multimodal conditional distributions.

Remember that few lines above, I emphasized the ‘need of incorporating randomness or non-deterministic factor in the hidden state of the RNN’ — VAE is the way through which this non-determinism can be gifted to the hidden state of the RNN.

Let’s see how.

The VRNN — Variational Recurrent Neural Network

From the discussions we did above, we got an overview of what fundamentally VRNN is all about — it’s an extension of VAE in the recurrent framework for the purpose of modeling high dimensional sequences.

The proposal is to incorporate the latent random variables in the hidden state of the RNN to model the variability observed in the sequential data. This is also facilitated by modeling the dependencies between the latent random variables across the timesteps.

“The VRNN contains VAE at every timestep, and these VAE’s are conditioned on the state variable h_(t-1) of an RNN”. This will help the VAE to take into account the temporal structure of the sequential data.

A detailed view of a cell at timestep t of VRNN

Prior Of The Latent Random Variable:

Recall from standard VAE that the distribution we assume for the latent random variables is standard gaussian and we approximate it by another distribution whose parameters are learned through a neural network function. But here, in the case of VRNN, there’s a twist. Now the prior distribution of the latent random variables is no longer a standard gaussian rather it follows a distribution given as:

that is, here the mean and variance are the parameters of the distribution governed by the above-mentioned function of h_(t-1) which usually is a neural network.

Therefore, by depending on the hidden state h_(t-1), the prior distribution of the latent random variable at timestep t depends transitively on all the preceding inputs which result in the improvement of the representational power of the model.

Generation — Observed Output Distribution:

From standard RNN, we know that the generation of x_t is dependent only on h_(t-1), but now in VRNN, it is made to depend on both z_t and h_(t-1).

i.e.,

where, the inner function of z_t basically works as a feature extractor that helps towards learning complex sequences, and the outer function is the generating function that jointly takes into account z_t and h_(t-1). Both of these functions are basically neural networks.

Updating The Hidden State:

Unlike standard RNNs, where the hidden state at time step t, i.e., h_t depends only on previously hidden state h_(t-1) and x_t, in VRNNs however, the hidden state additionally depends on z_t also.

Formally,

This equation ensures that both the variability of the data and the temporal dependence between the latent random variables across the timesteps are captured. The inner function of x_t works as a feature extractor as we have for z_t.

From equation C, we observe that h_t is a function of all previous inputs and latent random variables up to the current timestep.

Therefore, ‘A’ defines the distribution:

and ‘B’ defines the distribution:

Joint Distribution Governing The Overall Model:

Let’s consider the joint probability distribution when we try to model data using standard VAE. Here is what the situation looks like:

z is the latent random variable and x is the variable for which we want to generate the distribution, therefore the joint distribution factorizes as shown in the image.

Now in VRNN, this won’t work as we deal here with sequential data and temporal dependence needs to be captured, therefore the joint distribution of interest here is:

and the factorization of this distribution is given as:

where RHS contains the product of the distributions defined by ‘A’ and ‘B’ (as mentioned in the preceding section) across all timesteps.

Inference:

The approximate posterior in VRNN is a function of both h_(t-1) and x_t (as opposed to standard VAE, where it is conditioned only on x_t), which is given as:

The resulting factorization is therefore governed by:

In summary, we have:

The objective function:

The objective function is a timestep-wise variational lower bound:

(compare it with the variational lower bound of the standard VAE).

Conclusion

So we looked at a very amazing sequential generative model, Variational Recurrent Neural Network, which models the sequential data by introducing non-determinism in the hidden state of the recurrent neural network by incorporating the variational autoencoder at every timestep of the RNN. And we saw that this non-determinism is very important in order to capture the desired variability of the input sequence which thus enables the model to generate a more robust output distribution which is a good representative of the input data.

The model was tested on four speech datasets and one handwriting dataset. It showed promising results as compared to other models along the same line.

I will recommend you to go through the research paper to find more on this and explore other sections like experiments, results, and analysis, etc.

References

[1] A Recurrent Latent Variable Model For Sequential Data

EndNote

If you liked my work, do clap.