How to Make Deep Learning many Orders of Magnitude Faster

Deep Learning needs no big introduction due to its wild popularity. Deep Learning or deep neural networks are a very useful set of machine learning algorithms that have become super popular thanks to their wide range of applications and to the AI breakthroughs they have produced.

However, training a deep neural network with many layers, many hidden neurons, and tons of data is computationally expensive. Is there a way to accelerate this process? The answer is yes. However, you should be a mathematical genius to solve this extremely complicated problem.

First, you need to read this seminal paper:

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
By Andrew M. Saxe, James L. McClelland, Surya Ganguli
https://arxiv.org/abs/1312.6120

I had the same idea many years before this paper was published. But I never published intermediate results, i.e. solving intermediate problems. I wanted to solve the whole problem. But I was stuck and I am stuck. Because the whole problem is really hard.

So, what is this problem all about? First, we need to understand the gradient descent technique. Gradient descent is a technique to minimize the error expressed as a loss function. In each iteration, we descend ⍺ steps in the direction of the loss function’s gradient, where ⍺ is the learning rate. It is much more intuitive to see it in a graphical way:

The gradient descent technique

There are multiple algorithms to adjust the learning rate. Some of these techniques are:

  • Momentum starts with a big learning rate to avoid getting stuck in very suboptimal local minima. The changes in weights are big at the beginning. But then the changes get slower and slower in order to converge properly.
  • Another technique is to use simulated annealing. So, the statistical properties of simulated annealing makes the neural network try to converge to the global minimum.
  • The Adam optimizer uses different learning rates for each parameter. And this algorithm makes the learning rates adapt to different amounts of variation. If some values of learning rates prove to be better, the Adam optimizer will use them.

All these techniques are iterative numerical methods. But, what about using symbolic methods to solve this problem?

Learning as a Differential Equation System

Let’s make a valid analogy. Imagine we have 2 negative electric potentials. They create equipotential surfaces. And according to Maxwell equations, electric field lines are perpendicular to equipotential surfaces.

Equipotential surfaces are analogous to the surfaces in which the loss function of deep learning have equal values. And electric field lines are analogous to the trajectories deep learning would take if the learning rate ⍺ were infinitesimally small. In like manner, the trajectories of gradient descent are perpendicular to the surfaces in which the loss function of deep learning have equal values.

This is the iterative equation of gradient descent:

And this is the formula to compute derivatives in a numerical way:

What happens when the learning rate ⍺ tends to zero?

Let’s make another valid analogy. See this generalized equation of movement. In this case, v is the velocity vector which is a vectorial field that depends of the location x.

This generalized equation of movement is analogous to the iterative equation of gradient descent:

The parameters to optimize 𝚹 are analogous to the location x. The learning rate ⍺ is analogous to the time t. And the gradient of the loss function is analogous to the velocity vector v.

Let’s algebraically manipulate both equations at the same time. By moving terms, we obtain:

What happens when time t and the learning rate tend to zero? We obtain 2 differential equation systems: One for generalized movement and another for the learning curve of deep learning. They look simple. But movement can be tridimensional and deep learning systems could actually have millions of dimensions. Hence, each differential equation system have many equations.

Behind the simplicity and elegance of this differential equation system for deep learning, you can find an enormous amount of complexity. A deep learning system could actually have hundreds, thousands, or millions of dimensions. Hence, this differential equation system could also have hundreds, thousands, or millions of equations. Moreover, the loss function J(𝚹) could be incredibly complex, convex, multidimensional, and bizarre. It could have lots of local minima. Even the slightest experience (1 sample) could modify its topology in a significant way. A minimal change in the neural architecture could modify its topology in unexpected ways. You can only perceive this huge complexity when you try to solve this differential equation system.

We can easily prove that deep learning is a generalization of linear regression. So, I suggest you to do the following experiment. By using the least squares method as the loss function for linear regression and by solving this differential equation system, try to obtain the exact solutions for linear regression that we all know:

This is an alternative way to derive these formulae. But it works. It is surprising and reaffirming that both methods, direct algebraic manipulation and solving this differential equation system, converge to the same formulae.

In the paper “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks” (by Andrew M. Saxe, James L. McClelland, Surya Ganguli), they found the exact solutions for the differential equation system of deep linear neural networks. However, even McClelland, one of the godfathers of neural networks, could not solve the general problem, which is finding the exact solutions for the differential equation system of deep nonlinear neural networks.

What about recurrent neural networks, convolutional neural networks, and deep reinforcement learning? Such differential equation systems are even more complicated. Moreover, finding exact solutions will allow us to overcome the vanishing gradient problem of recurrent neural networks, which are infinitely deep. And enabling one-shot learning will make deep reinforcement learning even more powerful.

How relevant is the solution for this problem? By analogy, we can compare the performance of computing the exact solutions of linear regression versus iterating the least squares method as the loss function for linear regression. The method of exact solutions is many orders of magnitude faster than iterative numerical methods. Moreover, solutions are exact, not approximations like the ones iterative numerical methods generate.

In like manner, nowadays, deep learning is a numerical method which is hunger of the power of GPUs and TPUs. Finding exact solutions for deep learning could lower the computational complexity of deep learning in a dramatic way. We would need only the small computational power of cellphones to learn from complex data instead of exploiting the cloud. By solving this complex problem, we will save a lot of electricity required to do all the complex iterations of deep learning algorithms, which will be daily run in billions of computational devices around the world. So, the solution of this complex problem will be eco-friendly. I also predict that one-shot learning would be possible by doing algebraic manipulations of the exact solutions.

I’m actually trying to solve this really complicated problem by using functional analysis. My artificial mathematician is exploiting patterns of functional analysis in infinitely dimensional Banach spaces. Taylor series dissolve mathematical problems into infinitely small grain of sands, which are easier to handle algebraically. Artificial mathematicians are now a reality and they easily outperform the analytical power of human mathematicians:

Google AI system proves over 1200 mathematical theorems
https://mathscholar.org/2019/04/google-ai-system-proves-over-1200-mathematical-theorems/

I will make a final analogy. When Évariste Galois and Niels Henrik Abel tried to find a solution for the general quintic equation, they created a new kind of mathematics called Abstract Algebra. I predict Lie groups will play a key role at understanding the fields of convergence, the Banach fixed points. Lie groups are a subfield of abstract algebra capable of dealing with differential continuous manifolds, like the ones the loss function of deep learning generates. I created this Facebook group many years ago, in order to talk about these topics:

Categories & the Geometry of Mind (Facebook Group)
https://www.facebook.com/groups/categories.and.geometry.of.mind/

There is probably no neuroscientific evidence that the brain uses gradient descent. But, what if evolution found a way to implement exact solutions inside brains instead? After all, with sufficient time and experiments, evolution by natural selection can converge to whatever kinds of causal geometries, no matter how complex they are. For further investigation, you can read this paper:

Towards Biologically Plausible Deep Learning
By Yoshua Bengio, Dong-Hyun Lee, Jorg Bornschein, Thomas Mesnard, Zhouhan Lin
https://arxiv.org/abs/1502.04156

P.S. The ideas of this essay were taken from this specific part (1:03:51) of my Deep RL webinar:

Webinar on Deep Reinforcement Learning (Secure & Private AI Scholarship Challenge)
https://youtu.be/oauLZG9nAX0?t=3831

--

--

Juan Carlos Kuri Pinto
Secure and Private AI Math Blogging Competition

I’m a master of science in computer science, specialized in machine learning, graduated from Georgia Institute of Technology.