Optimal Transport Theory the New Math for Deep Learning

Photo by Cameron Venti on Unsplash

So there’s this mathematician who also happens to be a member of parliament of the French government. Cédric Villani, fields medalist in 2010, is also the proponent of the recent French strategy for AI.

How a mathematician got himself so involved in AI is a fascinating story in itself.

Although Villani has a unique fashion style, what I find most intriguing about him is how his particular branch of mathematics seems to have a lot of promising connections with Deep Learning. Here is one very insightful lecture on “Triangles, gasses, prices, and men” where he uncannily relates four different fields of mathematics. Specifically Ricci curvature, Boltzmann distribution, gradient flow, and optimal transport theory.

What’s interesting here is that we encounter these fields also in the study of DL. Riemann curvature can be found in Information Geometry where the curvature of the Fisher Information Matrix has relevance to the natural gradient. Boltzmann distribution is commonly found in the statistically distributions. Gradient flow is related to optimization algorithms. We see all these mathematical tools in our studies but we can’t put a finger on why they are all related. Villani’s “grand” unification explains mathematically why these seemingly disparate mathematical notions are all related.

Then there’s this idea of Optimal Transport (OT) Theory that we haven’t really encountered until very recently. It appears to me that OT seems to be an approach with a long mathematical tradition and with a rich enough set of mathematical tools that may have an outsized impact on future DL theoretical work.

A first encounter with OT is via the introduction of the Wasserstein measure, as an alternative to the more conventional Kullback–Leibler divergence (KL), in use as the fitness function for solving the mode collapse problem encountered when training Generative Adversarial Networks (GANs). The rationale for why GANs work well with Wassertein loss functions is that there are no discontinuities. This makes intuitive sense, that when moving material, it is all incremental.

It seems important to recognize the conceptual difference here in that rather than comparing probability distributions as in KL, we measure the cost of redistributing information from one distribution to another. It is as if the function of a DL network is like a logistical network that rather than distribute resources, it appears to distribute information. This indeed is a very appealing conceptual model as well as a very elegant mathematical model.

One apparent weakness of a Bayesian inference model is that it does not provide a prescription for evolving from one distribution to another. All it expresses is that the prior and posterior distributions are related. Even worse, it’s related only by axiom without proof! In contrast, OT lends an understanding of how information can be optimally redistributed (see: Principle of Least Action). So it actually expresses the mechanism of evolution.

The key to understanding DL networks is through understand them as generative models and not as descriptive models. The difference is that generative models are bottom up, that is simple machinery gives rise to emergent behavior. In contrast, descriptive models are top down, collecting statistics and conjuring up explanations of the cause of these statistics. The latter approach (which is prevalent among machine learning practitioners) isn’t really as useful as many have been indoctrinated to believe.

The intuition to use something like OT is that it appeals to this universal principle of nature known as the principle of least action. DL systems and the biological brain are generative models and thus we should let it employ the local mechanisms that it has at its disposal. In short, let us avoid any action-at-a-distance voodoo capabilities. When we gerry-rig a ‘faster than light’ capability, then we are diverging away from what is possible and that behavior can unnecessarily obfuscate the true mechanisms of cognition.

The other nice concept of OT is Lipschitz continuity. It is a more restrictive kind of continuity that leads to consequences such as ODE’s having unique solutions and that curves have finite length. It’s the kind of math that injects a level of realism into analytic equations.

A good primer on OT theory can be found in this video. There is also a book on “Computational Optimal Transport” that is freely available. 2018 is the year when you begin to find OT employed in many DL papers. The problem for practitioners however is that it will take time to become competent in this new math. The literature here is vast and can be very intimidating. Cedric Villani’s “Optimal Transport Old and New” is over 990 pages long! Even the names of the pioneers of this field are quite unfamiliar (i.e.: Kantorovich, Monge etc.). The Russians and the French appear to have developed over the centuries a rich set of mathematics that was previously unknown by the broader machine learning community.

The downside of the lack of familiarity leads to many DL submissions using OT vocabulary to be rejected from publication by the lack of familiarity by reviewers. Unfortunately, it isn’t just an issue of a difference in vocabulary, it is a difference in a fundamental view point of what exactly is a DL network. It boils down to a religious debate and ultimately degenerating into a lot of politics. One where the orthodoxy of Bayesian thinking again reveals its ugly head.

The high priests of machine learning have always been Bayesian thinkers, god forbid they be replaced by alternative thinkers like Optimal Transport proponents or even worse Complexity Scientists!

Further Reading

Exploit Deep Learning: The Deep Learning AI Playbook