How and Why Deep Learning Works

Published in

Secure and Private AI Math Blogging Competition

11 min readAug 10, 2019

Deep Learning or deep neural networks are a very useful set of machine learning algorithms that have become super popular thanks to their wide range of applications and to the AI breakthroughs they have produced.

However, most people think neural networks are black boxes whose contents cannot be understood. Hence, it is necessary to illustrate the audience about how and why deep learning works.

Neural networks are homeomorphisms

Let’s begin by explaining homeomorphisms, which is the most difficult concept behind the inner-workings of deep learning. Please read this article:

Neural Networks, Manifolds, and Topology
https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

As a summary of this article, we can infer homeomorphisms are mathematical projections from one multidimensional space to another multidimensional space, while preserving the characteristics and neighboring points of the first space.

In the article, we saw 2 interesting graphs in which we project the first graph, which is not linearly separable in 2 classes (blue and red), into a second graph which is linearly separable, by using a 2D neural network. This projection is a homeomorphism from one Euclidean space toward a bizarre space whose metric is not Euclidean. We do this very simplistic homeomorphism between 2 bidimensional spaces in order to understand how spaces are deformed by neural networks whose purpose is to project spaces into new spaces in which the patterns to classify are linearly separable.

First graph with 2 classes of patterns (blue curve and red curve) which are not linearly separable. This space is Euclidean because the uniform increments in (X, Y) produce lines and squares.

Second graph with 2 classes of patterns (blue curve and red curve) which are linearly separable in this deformed space, which is a projection or homeomorphism generated by a neural network whose purpose is to deform the first space in order to make the classification linearly separable.

Look how the squares in the first graph are deformed in the second graph in order to make the second space linearly separable in 2 classes: Blue points and red points.

This is a very simple example. But you can generalize in your mind, by using your imagination, how this process occurs when using neural networks of thousands of dimensions.

Deep abstractions as funnels

Deep learning is like a funnel that generates common features among patterns. This is a new definition of abstraction. We can make an analogy of this process when we find the common factor of a mathematical expression: A*B + A*C = A*(B+C). In a similar manner, deep learning finds the common features of patterns.

Deep learning makes homeomorphisms toward spaces of less and less dimensions that are mapped in a continuum. It allows to make ANALOGIES in continuous spaces of FEATURES. Features are generated automatically, unlike other algorithms of machine learning.

This is more alchemy than science because there is no mathematical proof, yet. It is cargo cult science because biological neural networks were imitated, without fully understanding them.

Douglas Hofstadter wrote an interesting book about analogy-making called: “Surfaces and Essences — Analogy as the Fuel and Fire of Thinking”.

Analogies are a flexible way of thinking. Analogies extrapolate patterns and avoid of type-2 errors. However, analogies increase the number of type-1 errors. Life is too complex and the amount of patterns is so infinite that it is impossible to deal with all individual patterns. Hence, we need to cluster patterns by similarity because similar patterns produce similar outcomes. In the context of Deep Reinforcement Learning (we will study it in few minutes), complex states cluster individual patterns in order to make the network of transitions between complex states more computationally tractable. So, similar complex states and similar actions produce similar outcomes.

Mathematics professors of technological institutes encourage students to find exact solutions to mathematical problems. If the answer to a numerical approximation is slightly different, you obtain zero points. That’s not the way brains work. Brains are rather flexible and use intuition. For example, robots should be able to handle a huge variety of doors. If a robot says “this pattern is not a door because it is not exactly equal to my model of door”, then that robot will commit a type-2 error and will not be able to handle such door in an appropriate way.

Hofstadter summarized it well in a quote:

“The entire effort of artificial intelligence is essentially a fight against computers rigidity.” — Douglas Hofstadter

In the last chapter of Hofstadter’s book, he proved by using many logical arguments that analogy-making is the same exact thing as categorization! And we already have formal methods of categorization (or classification) like deep learning. So, deep learning is a computational tool to make machines capable of making analogies.

The 2 purposes of non-linear activation functions

Non-linear activation functions have 2 purposes:
- to make the mapping non-linear as real-world problems;
- and to cut bizarre regions of the pattern space.

Real-world problems are chaotic, nonlinear, multidimensional, multicausal, complex, and bizarre. Hence, neural networks cannot be linear because neural networks are supposed to adapt to non-linear causal geometries.

Non-linear activation functions usually have 2 very different parts that separate the activation region and the non-activation region. So, non-linear activation functions cut the pattern space into complex states and serve to model the frontiers of patterns.

Cybenko’s Universal Approximation Theorem

Single layer perceptrons can only separate patterns by using linear functions. So, they are very limited. In fact, Marvin Minsky wrote a critique against perceptrons, telling that perceptrons cannot solve the XOR problem. And that paper created the AI winter.

Two-layer neural networks can represent convex open or closed regions of the pattern space, which are more sophisticated. However, they are not versatile enough to characterize the frontier of arbitrary patterns.

In 1998, Cybenko wrote a paper explaining the Universal Approximation Theorem which states that three-layer neural networks with sufficient hidden neurons can characterize the frontier of arbitrary patterns. In other words, sufficiently large neural networks are adaptive and shapeless like amoebas that can take any shape through adaptation.

Neural networks are both discriminative classifiers and generative classifiers

Discriminative classifiers model the frontier of patterns. They have less parameters and occupy less memory. An example of a discriminative classifier is the perceptron:

How the perceptron works as a classifier whose frontier is linear.

Generative classifiers model the stereotype of patterns. They have more parameters and occupy more memory. An example of a generative classifier is k-nearest neighbors, which memorize everything and classify by clustering the nearest neighbors in order to make a stereotype of the class:

Neural networks are both discriminative and generative classifiers. The most famous example of this are auto-encoders that act as pattern compressors. Auto-encoders have an encoding part that does dimensionality reduction (or compression) and have a decoding part that reconstructs the original pattern, proving that neural networks are also generative classifiers. Autoencoders model both the frontiers and the stereotypes of patterns. They have a decent amount of parameters and occupy a decent amount of memory.

Neural networks, PCA, ICA, eigen vectors, and eigen values belong to the same family of algorithms. You can google “eigen faces” and you will notice that the stereotype is more beautiful than the samples. Why? Because beauty can be expressed as a consequence of the least-action principle. Patterns that produce the least neuronal resistance, or the least difference when compared to the neural stereotypes, are considered more beautiful by brains. In the case of faces, classical faces are more beautiful. All artists know this.

Is deep learning inspired by biology?

The Hodgkin–Huxley model of the squid neuron was the first serious attempt to characterize the properties and dynamics of biological neurons. However, the most realistic model of human neurons is the Izhikevich model, which characterizes spikes, bursts, low frequencies, and high frequencies of brain activity.

Spiking neural networks are more sophisticated than artificial neural networks. Because spiking neural networks can model other aspects of cognition, i.e. neuronal synchronization. However, artificial neural networks are more practical and work well because they are more computationally tractable than spiking neural networks.

Artificial neural networks are based on the McCulloch-Pitts perceptron. Each neuron (or unit) has some synapses (or weights) to represent the strength of connections to other neurons. Synapses are the parameters that vary when using the gradient descent technique. The adaptation of synapses produces learning. Synapses (or weights) are multiplied by the inputs, the signals coming from other neurons. Big positive weights excite the neuron. Whereas big negative weights inhibit the neuron. Small weights or zero weights are indifferent. The resulting excitatory or inhibitory signals are summed up inside the neuron body and finally an activation function computes the final output of the neuron which is connected to other neurons, and so on.

In the online course Computational Neuroscience at Coursera, Prof. Rajesh Rao made some assumptions and algebraic manipulations to demonstrate that there is indeed a mathematical connection between spiking neural networks and the McCulloch-Pitts perceptron. Perceptrons operate in the frequency domain. Hence, the numbers they represent are frequencies of activations. Bigger numbers represent bigger frequencies of activation. And smaller numbers represent smaller frequencies of activation. Remember that Fuster defined consciousness as the high-frequency activations of cognits, beyond the threshold of awareness. Cognits with lesser frequencies of activation represent the subconscious, which are the parts of the brain which are silent in a particular moment, waiting for the proper sensory stimuli to be activated to represent thoughts corresponding to phenomena in the real world.

A slide of a lecture in the online course Computational Neuroscience at Coursera.

“Now you know that this equation that people in the artificial intelligence community have been using for a very long time as, in fact, a simplification of the rich dynamics that one has in the synaptic current, as well as the dynamics of the output firing rate.” — Rajesh Rao, professor of computational neuroscience at Cousera

Deep Reinforcement Learning

Deep Reinforcement Learning (or Deep RL) is the perfect synergy between 2 worlds: Reinforcement learning and deep learning.

Reinforcement learning comes from the family of search algorithms which are general problem solvers. So, these algorithms are very powerful, versatile, and expressive. They represent the world as a network of states. At each state, the agent can take many actions whose outcomes are other states, with an associated probability of occurrence and a reward (R) or cost (C). A cost is a negative reward.

A network of states (S), actions (a), probabilities (P), and costs (c).

Deep RL attaches a deep neural network to sense the world through sensors and learns the most appropriate actions that maximizes the rewards and minimizes the costs. It is a pretty elegant and powerful solution to many problems.

Deep RL can learn from the world by exploring it in an unsupervised way. However, the environment is the teacher or supervisor of the deep neural network (which is a supervised method). The environment punishes or rewards the actions taken by the agent at each specific state. The agent constantly operates in the environment in real time through the perception-action cycle.

The neural network used by Deep RL can be very complex and very deep as this convolutional neural network used by DeepMind Technologies to make a Deep RL agent play a lot of Atari videogames. Notice that the sensed state is as complex as the pixels in the screen of an Atari videogame and the actions are all the possible combinations of actions that a Atari joystick can take.

Graph taken from a paper of DeepMind Technologies.

So, the Deep RL agent learns how to maximize the reward points by seeing the Atari screen and experimenting random actions of the joystick. At the beginning, the exploration of actions is random and at the end the explorations of actions is driven by a greedy algorithm (i.e. epsilon greedy) which seeks to maximize the reward.

Learning inverses instead of deducing them

That introduction to Deep RL was necessary to explain an important strategy of deep learning: Learning inverses instead of deducing them.

For example, in my undergraduate thesis I modeled a robotic arm by using Lagrangian mechanics. Here is the physical model:

And here are the Lagrangian, the partial derivatives, and the differential equation system generated from this physical model.

This differential equation system is so complex that we cannot generate a symbolic solution. We should use numerical methods like Runge-Kutta to generate approximations of the inverse.

In the real world, animals know neither mathematics nor Lagrangian mechanics. However, they can move and walk. They learn to move by intuition. They learn the inverse to control their movements rather than deducing the inverse.

Perception also has an inverse. We perceive by using causal induction that goes from effects (sensations) to causes (perceptions). Almost everything in the field of artificial intelligence is causal induction. That’s why Solomonoff induction is very important to solve Artificial General Intelligence (AGI).

An example of perception is vision. Animals know neither mathematics nor projective geometry nor linear algebra. However, they can see. They learn to see by intuition. They learn the inverse directly to see rather than deducing the mathematics of computer vision. They exploit the abilities of neural networks to characterize both the stereotypes of objects (generative models) and the frontiers of complex patterns (discriminative models).

In like manner, only 10 world experts are capable of modeling human speech and room acoustics. However, animals who don’t know anything about mathematics can hear and infer things from sounds. They learn to hear by intuition.

So, we have a heated debate: Reductionism (rationality) versus holism (intuition), the irrational approach of Hinton. Jitendra Malik is an expert in computer vision who uses reductionism to crack computer vision. He is the opposite of Geoffrey Hinton, who uses the artificial intuition of neural networks to solve computer vision. And the winner was intuition. The subconscious, reflexology, and intuition are the opposite of deduction and rationality.

Correlation does not imply causation. Reductionist scientists repeat it all the time. However, intuition suggests us that correlation helps to infer causation. If some events are highly correlated, we can start to suspect they are causally connected. And if they are correlated in all the cases, we can safely infer causation.

Moreover, neural networks are not black boxes. Experts in causation like Judea Pearl often criticize deep learning by saying that deep learning cannot do causal induction. But we can trace and track the neural activations of Deep RL in order to infer the weird causal pathways inside of it. So, Deep RL can do causal induction as well, in a bizarre way.

P.S. These ideas were taken from my webinar on Deep RL:

Webinar on Deep Reinforcement Learning (Secure & Private AI Scholarship Challenge)
https://youtu.be/oauLZG9nAX0

How and Why Deep Learning Works

Written by Juan Carlos Kuri Pinto