Should Deep Learning use Complex Numbers?

Mandlebrot Set: https://en.wikipedia.org/wiki/Mandelbrot_set

Is it not odd to anyone that Deep Learning uses only real numbers? Or perhaps, it would be even odder if Deep Learning uses complex numbers (note: the kind with imaginary numbers). One viable argument is that it is highly unlikely that the brain uses complex numbers in its computation. However, you can make the argument also that the brain doesn’t perform matrix multiplication or perform chain rule differentiation. Besides, Artificial Neural Networks (ANN) have a cartoonish model of actual neurons. We’ve long past replaced biological plausibility with real analysis (i.e. theory of function with real variables). Deep Learning researchers have been patting themselves on the back when they discovered that linear algebra and a sprinkling of basic calculus (i.e. chain-rule) was more than enough math to show groundbreaking results.

However, why should we even stop with real analysis? We’ve already bet the kitchen sink on linear algebra and differential functions, we might as well just go all in and bet the farm on complex analysis. Perhaps the weirder world of complex analysis will endow us with more powerful methods. After all, if it worked for Quantum Mechanics, then perhaps it may just work for Deep Learning. Besides, Deep Learning and Quantum Mechanics are both all about information processing, both could just be the same thing!

So for arguments sake, let’s shelve any thought about the need for biological plausibility. That’s an old argument that we’ve passed back in the 1957 when the first ANN was proposed by Frank Rosenblatt. Let the Numenta, Neuromorphic and Connectome folks worry about that hard problem. Deep Learning has a lot more pressing problems to fry. So the question then is, what can complex numbers provide that real numbers cannot?

In the last couple of years, there have been a few papers that have explored the use of complex numbers in Deep Learning. Surprisingly enough, a majority of them have never been accepted into a peer-reviewed journal. Deep Learning orthodoxy is simply prevalent in the discipline. However, let’s review some of the interesting papers.

DeepMind has a paper “Associative Long Short-Term Memory” (Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, Alex Graves) that explores the use of complex values for an associative memory. The system is used to augment the memory of an LSTM. The conclusion of the work is that the use of complex numbers yields higher memory capacity networks. The tradeoff in terms of the mathematics is that the use of complex numbers requires smaller matrices as compared to just using real numbers. The following graph shows that there is a measurable difference (as compared to traditional LSTM) in memory costs:

Yoshua Bengio and his team in Montreal have explored another aspect of the use of complex values. In a paper titled “Unitary Evolution Recurrent Neural Networks” (Martin Arjovsky, Amar Shah, Yoshua Bengio) the reseachers explore Unitary matrices. They argue that there may be real benefits in terms of reducing vanishing gradients if the eigenvalues of a matrix are close to 1. In this research, they explore the use of complex values as the weights of the RNN network. The conclusion of this work is:

Empirical evidence suggests that our uRNN is better able to pass gradient information through long sequences and does not suffer from saturating hidden states as much as LSTMs

Where they take several measurements to quantify the behavior vs more traditional RNNs:

A system using complex values clearly has more robust and stable behavior.

A paper also involving Bengio’s group and folks at MIT ( Li Jing, Caglar Gulcehre, John Peurifoy, Yichen Shen, Max Tegmark, Marin Soljačić, Yoshua Bengio ) extend the approach with the use of Gating mechanism. The paper “Gated Orthogonal Recurrent Units: On Learning to Forget” (aka GORU) explores the possibility that long term dependencies are better captured and that can be lead to a more robust forgetting mechanism. In the following graph, they show that other RNN based system fail in the copying task:

A team at FAIR and EPFL ( Cijo Jose, Moustpaha Cisse and Francois Fleuret ) has a similar paper in “Kronecker Recurrent Units” where they also use unitary matrices to show viability in the copying task. They show a method of matrix factorization that greatly reduces the parameters required. The paper describes their motivation of using complex values:

Since the determinant is a continuous function the unitary set in real space is disconnected. Consequently, with the real-valued networks we cannot span the full unitary set using the standard continuous optimization procedures. On the contrary, the unitary set is connected in the complex space as its determinants are the points on the unit circle and we do not have this issue.

One of the gems in this paper is this very insightful architectural idea:

the state should remain of high dimension to allow the use of high-capacity networks to encode the input into the internal state, and to extract the predicted value, but the recurrent dynamic itself can, and should, be implemented with a low-capacity model.

So far, these methods have explored the use of complex values in RNNs. A recent paper from MILA “Deep Complex Networks” ( Chiheb Trabelsi et al.) further explores the approach in its use to convolution networks. The authors test their network on vision tasks, with competitive results. Yann LeCun, the inventor of convolution networks, also has a paper “A mathematical motivation for complex-valued convolutional networks”, that explores the rational for using complex numbers.

Finally, we have to mention something about its use in GANs. After all, this seems to be the hottest topic. A paper “Numerics of GANs” (by Lars Mescheder, Sebastian Nowozin, Andreas Geiger ) explores the troublesome convergent properties of GANs. They explore the characteristics of the Jacobian with complex values. Which they use to create a state-of-the-art approach to the problem of GAN equilibrium.

In a post last year, I wrote about the relationship between the Holographic Principle and Deep Learning. The approach explored the similarity of Tensor networks with that of Deep Learning architectures. Quantum mechanics can be thought of using a more generalized form of probability:

Quantum theory can be seen as a generalized probability theory, an abstract thing that can be studied detached from its application to physics.

The use of complex numbers permits additional capabilities that can’t be found in normal probability. More specifically, the capability of superposition and interference. So to achieve holography, it’s always nice to have complex numbers at your disposal.

A majority of mathematical analysis that is performed in the machine and deep learning spaces tend to use Bayesian ideas as their arguments. Actually most practitioners think its Bayesian but it really comes from statistical mechanics (despite the name, there’s no mumbo-jumbo statistics speak in stat-mech). Yann LeCun actually caught the evidence and he has it all in a tape.

But, if Quantum Mechanics is a generalized form of probability, then what would happen if we use QM inspired methods instead? It turns out that research has previously done on this, and the results are worthy of note. In a paper written late last year, “Quantum Clustering and Gaussian Mixtures” the authors (Mahajabin Rahman, Davi Geiger) explored the use in unsupervised k-means scenario. They report the following:

As a result, we observe the quantum class interference phenomena, not present in the Gaussian mixture model. We show that the quantum method outperforms the Gaussian mixture method in every aspect of the estimations.

Here’s the comparison in pictures:

What happened to the noise?!

So one has to wonder, why are people stuck with an 18th century Bayes Theorem when there exists a 20th century (i.e. Quantum Mechanics) theory of probability? (Note: It’s just shocking that the cargo-cult science of Statisticians have been running their farce since the 18th century)

The research papers mentioned here shows that there indeed many “real” advantages of using complex values in deep learning architectures. The research indicates more robust transmittal of gradient information across layers, higher memory capacity, more precise forgetting behavior, drastically reduced network sizes for sequences and greater stability in GAN training. These are too many advantages that cannot be simply ignored. If we are to accept the present Deep Learning orthodoxy of any layer that differentiable is fair game, then perhaps we should make use of complex analysis where there is a lot more variety in the grocery store:

Perhaps one reason complex numbers aren’t used as often is the lack of familiarity by researchers. The mathematical heritage of the optimization community doesn’t involve the use of complex numbers. There’s little need for complex numbers in Operational Research. Physicists on the other hand use it all the time. Those imaginary numbers keep popping up all the time in quantum mechanics. It isn’t weird, it just happens to reflect reality. We still have little understanding of why these DL systems work so well. So seeking out alternative formulations could lead to some unexpected breakthroughs. This is the game we play today, the team that accidentally stumbles on the AGI breakthrough wins the entire pot!

In the near future, the tables may turn. The use of complex values may be more common place in SOTA architectures and its absence may turn out to be odd. I guess when that happens, the 18th century Bayesians will finally be out of business.

New papers: https://openreview.net/pdf?id=H1T2hmZAb, https://arxiv.org/abs/1710.09537v1,https://openreview.net/pdf?id=Bk4ELt10Z

Further Reading:

https://en.wikipedia.org/wiki/Wirtinger_derivatives

Explore more in this new book:

.

Explore Deep Learning: Artificial Intuition: The Unexpected Deep Learning Revolution
Exploit Deep Learning: The Deep Learning AI Playbook