A Paper in 5 mins — FactorVAE

Nicola Bernini

Published in

Discussing Deep Learning

4 min readMay 29, 2020

The goal of this article is to explain the core of this paper in just 5 minutes of your time.

1. Introduction

We are going to explore the following paper and understand its core contribution

Disentangling by Factorising

We define and address the problem of unsupervised learning of disentangled representations on data generated from…

arxiv.org

Learning a disentangled representation is more or less like learning a “code” where each digit represents a specific “factor of variance” of the dataset : for example, if we were talking about faces then one digit could control the color of the skin, another the color of the hair and another the type of hair and so on.
If this representation is attached to a generative process, then these “digits” not only represent some semantic of the input but they can also control it in the generative process itself.

Generative Processes are well suited for representation learning as it is possible to measure, with appropriate metrics, both the quality of the reconstruction (reconstruction loss) and in the case of disentangled representation learning also the quality of the disentanglement.

VAE is a type of generative model relying on a probabilistic representation of the latent code and they are a commonly used tools to try learning such a disentangled representation.

VAE Scheme from Tutorial on Variational Autoencoders, on the left implemented in its standard form and on the right implemented with the reparametrization trick

2. The Paper Core Idea

This paper is essentially an improvement of Beta VAE which addressed this kind of learning focusing on the Objective Function of the standard VAE which consisted of 2 factors

reconstruction loss
a KL divergence termbetween the learned representation and a prior which is disentangled so, in probabilistic terms, its Joint Distribution is factorized

The Beta VAE paper improves the standard VAE objective function by adding the beta factor which is aimed at fine tuning the importance of the KL divergence term, hence the force towards the prior.

Beta VAE Objective Function (from the paper)

This is empirically observed to come at the cost of degraded reconstruction performance so more disentanglement means lower performance.

FactorVAE key insight consists of understanding the loss function at a deeper level and observing the second term is a sum of

mutual information between the input and the code
the actual divergence with respect to the prior

Mutual Information + KL Divergence (from the paper)

Penalizing the KL divergence, as it happens in Beta VAE, means penalizing both the mutual information and the KL divergence and thishas both a positive and a negative effect

the positive effect is related to penalizing the divergence with respect to the prior, as we want our representation to be as disentangled as possible
the negative effect is related to penalizing mutual information, as we want our code to preserve as much information about the input as possible, otherwise we will observe degraded reconstruction performance which is what is actually observed

So FactorVAE essentially proposes a new loss fixing this issue