Notes on the Cramer GAN

The following discussion relates to the recent paper:

“The Cramer Distance as a Solution to Biased Wasserstein Gradients”
https://deepmind.com/research/publications/cramer-distance-solution-biased-wasserstein-gradients/

This essay is going to be a bit lengthy, so I’ll start with a TL;DR. First of all, in dimensions greater than 1, the paper doesn’t use the Cramer distance! In the GAN experiments of Section 5, it uses the energy distance [1]. The model is therefore a generative moment matching network like [3,4], using a particular kernel. Since the energy distance is an integral probability metric, the scheme proposed by [5] can be used to train input features to the kernel, which improves results over [3]. I explain this in Section 1 below.

Overall, this is a good idea, and seems to give nice generator samples. Unfortunately, the Cramer GAN paper makes a problematic approximation, which means that the critic in this algorithm does not correctly compare the generator and reference (target) sample distributions: you can have a zero critic loss even with very different generator and reference distributions. I explain this problem in Section 2 below.

The paper [6], which proposes a similar idea and appeared the same time, also seems to achieve good results, using a different (Gaussian) kernel and deep features. The paper [7] is also related, and does not use the optimisation of [5], but uses a variance control scheme, again getting good results.

Now on to the details.

Section 1: Energy distance, witness function, gradient penalisation, kernel choice

The critic in the paper is the Energy Distance [1] (on p. 5 of their paper, the authors state that when generating samples in more than one dimension, d>1, they use the energy distance, not the Cramer distance). It was shown in [2] that this distance belongs to the family of integral probability metrics (IPM), just like the Wasserstein distance (by contrast, KL and reverse KL divergences are f-divergences, not IPMs). Specifically, the energy distance is a maximum mean discrepancy (MMD), as first used for training generative models in [3,4] (these were called Generative Moment Matching Networks). If you’re unfamiliar with the MMD, see Appendix B below for a very short introduction. The IPM formulation is the essential property which allows us to apply the method [5] to train a GAN.

Integral probability metrics measure the distance between probability distributions by looking for a smooth function (the “witness function”) which maximises the difference in expectation under the two distributions. So if P is the distribution of your generator samples, and Q is your distribution of reference samples, then you’ll look for a smooth function f for which E_X f(X) — E_Y f(Y) is large, where X is drawn from P and Y is drawn from Q, and E_X denotes the expectation under P. The method [5] introduces a penalty to keep the gradient of the critic’s witness function close to 1 during training, at points between the generator and reference samples. So as soon as we have the witness function, we can use [5].

What is the witness function for the energy distance? To figure this out, we use the fact that it is a Maximum Mean Discrepancy with kernel:

k(x,x’) = d(x,0) + d(x’,0) — d(x,x’), (1)

where d(x,x’) is the distance between x and x’ (see [2,Definition 13], ignoring the factor of 0.5), and “0” is the origin (this is actually a zero in the Medium typeface…) Using this kernel, and the expression for the witness function in [8, Section 2.3], we can derive the expression for the witness function f*(x) above eq. 5 in the Cramer GAN paper (which otherwise seems to appear out of nowhere!). I have proved this result below (Appendix A).

You can penalise the gradient of the witness function in the critic loss, just as proposed by [5]; or it’s possible to use the earlier “clipping” method of [9], which is done by [6], and also works fine.

The authors claim the energy distance kernel might be a better choice for GAN training. I am not convinced of the argument: theory-wise

  • it is not clear why sum-invariance and scale-invariance are essential for GAN training. Sum-invariance is also true for any translation invariant kernels, including the Gaussian kernel (the proof is trivial)! Sum and scale invariance for the energy distance were in fact already proved by Szekely and Rizzo, so it’s not clear why the authors reproduce these earlier proofs in Proposition 3.
  • The unbiased gradient estimate is true much more broadly, and not just for the energy distance. Subject to certain conditions, the Leibniz rule means that an unbiased estimator has unbiased gradients, and the MMD has an unbiased estimator for any kernel [8].

Experimentally, I’m also not convinced that this kernel choice is the main reason the method works, since [6,7] use different kernels and also get good results ([6] uses the good old Gaussian kernel). On the other hand, the authors in their Appendix D claim more stable training than with the Gaussian and Laplace kernels. It may be that the method of [5] is the key to training the MMD GAN, but note that [6] uses clipping, and [7] uses variance control rather than penalising the gradient, so [5] is not the only way to make it work.

Section 2: The critic is not correct

Unfortunately, the paper makes a problematic approximation, which causes the critic not to correctly compare and match the generator and reference distributions. This can be fixed: [6] and [7] already propose approaches which do not have this issue, and which work well.

To understand the problem, think of x and x’ as being independent samples from the generator P, and y and y’ as being independent samples from the reference distribution Q. The energy distance is:

D(P,Q) = E_{X,X’} d(X,X’) + E_{Y,Y’} d(Y,Y’) — E_{X,Y’} d(X,Y’) — E_{X’,Y} d(X’,Y)

where E_{X,X’} d(X,X’) is the expected distance between two independent samples from the generator P (likewise, Y and Y’ are independent samples from the reference Q). Let’s understand what this means: the first term is the average distance between two samples from the generator. The second is the average distance between two reference samples. The final two terms both give the average distance between a generator and reference sample. In other words, if the samples from the generator have the same distribution as the reference samples, then all terms will be the same and will cancel, giving an energy distance of zero when P=Q.

We now turn to the distance implemented by the critic in algorithm 1 of the Cramer GAN paper. From the definition of critic witness f(x) in Algorithm 1, we see that the expected “surrogate loss” used for the critic is:

D_c (P,Q) = E_{X,X’} d(X,X’) + E_{Y} d(Y,0) — E_{X} d(X,0) — E_{X’,Y} d(X’,Y)

In brief, the problematic approximation is to replace y’ by the origin 0, at which point the interpretation of the critic as a distance between P and Q is lost. It is easy to come up with P and Q which are completely different, yet have an expected critic loss of zero. For a simple 1-D example, let’s say P puts all its samples at the origin, and Q puts all its samples a distance t from the origin. Clearly P and Q are not the same, and yet

E_{X,X’} d(X,X’) = 0
 E_{Y} d(Y,0) = t
E_{X} d(X,0) = 0
E_{X’,Y} d(X’,Y) = t

so D_c (P,Q) = 0, and the critic mistakenly thinks P and Q are identical. It’s difficult to predict the problems this issue will cause in complicated examples like Figures 9 and 10 of the paper, but for simpler toy examples in fewer dimensions, it will quickly become clear that P and Q are not being matched correctly. Finally, note that [6] and [7] do not make this approximation, and are unaffected.

References

[1] The energy distance corresponds to the Cramer distance in one dimension, but we don’t use GANs to generate samples in 1-D. See Szekely and Rizzo in their 2004 paper “ Testing for Equal Distributions in High Dimension”.

[2] Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K., “Equivalence of distance-based and RKHS-based statistics in hypothesis testing”, Annals of Statistics, (2013)

[3] Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. (2015). Training generative neural networks via maximum mean discrepancy optimization. UAI

[4] Li, Y., Swersky, K., and Zemel, R. (2015). Generative moment matching networks. ICML

[5] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. (2017). Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028.

[6] Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and Póczos, B. (2017). MMD GAN: Towards deeper understanding of moment matching network. arXiv preprint arXiv:1705.08584.

[7] Youssef Mroueh, Tom Sercu, Fisher GAN, https://arxiv.org/abs/1705.09675

[8] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel two sample test. JMLR, 2012

[9] Martin Arjovsky, Soumith Chintala, Léon Bottou, Wasserstein GAN, https://arxiv.org/abs/1701.07875

Appendix A: Derivation of witness function

From [8, Section 2.3], the witness function is (up to a constant of proportionality):

f*(x) = E_X k(x,X) — E_Y k(x,Y), (2)

where E_X k(x,X) denotes the expectation of the kernel with respect to one of its arguments, with X drawn from P and Y drawn from Q. Substituting the kernel (1) into the first term, we get:

E_X k(x,X)
= d(x,0) + E_X d(X,0) — E_X d(x,X)

The second term of the above expression is constant. Substituting into (2),

f*(x) = d(x,0) — E_X d(x,X) — d(x,0) + E_Y d(x,Y) + C
= E_Y d(x,Y) — E_X d(x,X) + C

where C is a constant and can be ignored. This gives the desired witness function.

Appendix B: quick overview of the MMD

The Maximum Mean Discrepancy (MMD) is a simple measure of distance between two probability distributions [8]. In the GAN case, define P as the generator distribution, and Q as the reference distribution. Then the (squared) MMD is:

MMD²(P,Q) = E_{X,X’} k(X,X’) + E_{Y,Y’} k(Y,Y’) — E_{X,Y’} k(X,Y’) — E_{X’,Y} k(X’,Y)

where k(x,x’) is the “similarity” of x and x’, and E_{X,X’} k(X,X’) is the expected “similarity” of two independent samples from the generator P.

How do we interpret this distance? We use the fact that k(x,x’) is a kernel, i.e. a dot product of features of x and of x’. It is large when x and x’ are similar, and small when they are different. Going back to our MMD expression, the first term is the average similarity between two samples from the generator. The second is the average similarity between two reference samples. The final two terms both give the average similarity between a generator and reference sample. In other words, if the samples from the generator have the same distribution as the reference samples, then all terms will be the same and will cancel, giving an MMD of zero.

The MMD is an integral probability metric, just like the Wasserstein distance. The witness function for this metric is given in Appendix A above, in equation (2).

Regarding what kernel to use: this is a long and storied question! But very briefly, one well known kernel is the “Gaussian” (exponentiated quadratic) kernel,

k(x,x’) = exp (- d²(x,x’) * a)

where d²(x,x’) is the squared Euclidean distance between x and x’, and a is a width parameter. Another kernel is eq. (1) earlier in the document, which is the kernel used to get the Energy Distance. Both kernels give valid integral probability metrics measuring the distance between P and Q. There are many other options, and there’s not yet a clear consensus as to what’s best for GANs.