Nifty NAFs: Universal Density Estimation with Neural Autoregressive Flows

Nathan Schucher
Element AI Lab
Published in
4 min readSep 27, 2018

Probability is a fundamental building block for intelligent learning systems. It provides tools for dealing with uncertainty or variation in our measurements, data, or the predictions our algorithms produce. Existing methods sacrifice expressivity for scalability and have limitations in the shape or form of the distributions they can represent. Inflexible models struggle to capture important variation in data, making it difficult to reason about uncertainty in the world. These problems become increasingly relevant as the dimensionality or the modality of the data grows (e.g. with multi-class image, audio, or video data).

In July 2018 our research lab published a paper at ICML titled Neural Autoregressive Flows (NAF). This paper describes a unifying framework for this family of recent techniques, and proves that (with enough compute) it can approximate any probability distribution! This paper builds on recent work that tackles high-dimensional data and generalizes it to handle multimodal data, and more complex data.

Code to reproduce experiments from the paper is available here. To jump in and start playing with NAFs right now check out these examples on GitHub. Keep reading below for a more technical introduction to the topic, and take a look at the paper for all the details.

Further instructions available on GitHub: https://github.com/CW-Huang/naf_examples

Neural Autoregressive Flows

Normalizing Flows as a method for modelling probability distributions have a wide range of applications in generative modelling, variational inference, density estimation, hierarchical reinforcement learning, and probability density distillation. Prior related work in improving variational inference with Inverse Autoregressive Flows (IAF) is referred to in this work as Affine-Autoregressive Flows (AAF). These methods have had great success more recently in achieving state-of-the-art results in density estimation (MAF), and speech synthesis (Parallel Wavenet).

Left: Target Distribution, Right: Modeled by NAF.

One of the key challenges in generative probabilistic modeling and density estimation is the trade-off between expressivity and their tractability. Neural Autoregressive Flows provide a way to combine expressive transformations with tractable changes in probability distributions.

In order to achieve this balance, NAF decomposes autoregressive flows into two components: an autoregressive conditioner, and an invertible transformer. The transformer (pictured below in red) can be any invertible function; in this work we propose two monotonic neural networks: Deep Sigmoidal Flows (DSF), and Deep Dense Sigmoidal Flows (DDSF). The conditioner (in blue) is another neural network that outputs parameters for the transformer at each step, as a function of the preceding input features (X_i). In principle, any neural architecture that satisfies the autoregressive property can be used as the conditioner, including MADE (which we use in the paper), PixelCNN, PixelSnail, etc. This can be seen as a hypernetwork structure where the conditioner outputs the parameters of the nested transformers.

The conditioner (blue) and transfomer (red) combined to implement a Neural Autoregressive Flow

To understand how NAF works, one can try to think of it as a way to do inverse sampling: a universal method to sample from a (multivariate) distribution. To simulate random samples from a distribution, one simply needs to pass samples drawn uniformly at random through the inverse cumulative distribution function (CDF) of the distribution. Since given arbitrary ordering, one can factorize the joint probability into a product of conditionals, NAF can be used to approximate the inverse CDF for the conditional of each dimension. Take the two dimensional distribution below, for example. Conditioned on different values of x1, NAF approximates how the distribution of x2 changes shape when x1 varies according to p(x2|x1): the conditioner gives us the conditioning and the transformer fits the target inverse conditional CDF.

NAF approximates the joint distribution (left) by learning to fit the conditional distribution (right)

Based on this intuition, we also prove that Neural Autoregressive Flows are universal density approximators. As far as we know, this is the first proof of universal density estimation for finite normalizing flows.

The video below shows how NAF (instantiated by DSF and DDSF) with a neural transformer outperforms the autoregressive flow with an affine transformer (AAF) in fitting the energy function. In most cases, while AAF is struggling to warp the shape of the resulting distribution, both DSF and DDSF converge much faster with higher fidelity.

Comparison of existing methods (AAF), with NAF methods (DSF, DDSF) in capturing multiple modes

In terms of quantitative performance, we achieve state-of-the-art results on a suite of benchmark density estimation tasks.

We also have results on improving approximate posterior distribution for Variational Autoencoders (VAE) and a toy Bayesian model. For these and more qualitative analyses, see section 6 of our paper.

Future Work

This work builds on a large body of prior work on normalizing flows, but there are still many open questions remaining, and lots of room for future research. Some directions that interest us include:

  1. Exploring other families of transformer functions (closed-form inverse functions)
  2. Exploring other choices of conditioners (PixelCNN(++), TAN, PixelSNAIL)
  3. Understanding the relationship between expressivity and trainability
  4. Understanding how composing multiple flow layers improves expressivity
  5. Further application in as probability density estimation, copula methods, hierarchical RL

Again, you can find the paper here, and the code is available on Github.

--

--