Machine learning has managed some impressive feats, even beating human minds at tasks we never thought machines will be able to take over, all by focusing on function optimisation. Neuroscience, on the other hand, has made progress in the detailed analysis of how our mind works, managing to discover a wide range of brain areas, cell types and mechanisms. It seems as though neuroscience and machine learning are on completely separate trajectories. However, they have much in common, and this article is going to explore their relationship.
This article is based on a research paper by DeepMind, Towards an Integration of Deep Learning and Neuroscience.
Machine learning, in particular, neural networks, are obviously inspired by the neurons in our brains, but many of the modern major developments have been due to breakthroughs in mathematics of efficient optimisation and algorithmic capabilities. Neural networks have progressed from simple linear systems, to complex architectures such as deep and recurrent networks that mimic memory function in human minds. Gradient descent algorithms have also improved to include momentum terms, efficient weight initialisation and conjugate gradients. These improvements are not directly related to neuroscience.
In particular, there are three aspects of modern neural networks that ultimately show similarity to the human brain. Firstly, neural networks are focused on the optimisation of cost functions. Secondly, modern machine learning has introduced complex cost functions that are not uniform across the layers of neural networks, even accounting for the interaction between the layers of neural networks. Thirdly, the architectures themselves have evolved to become more complex, now containing memory cells with multiple states. We argue that these are similar to how the human brain works. We formalise this using the following hypotheses:
Hypothesis 1: The Brain optimises cost functions
We understand what it means for mathematical functions to be optimised, but what exactly would it mean for a brain to be optimising cost functions. Many processes can be classed as optimisations. Laws of physics minimise the action functional, and evolution optimises the fitness of replicators over a long timescale. The hypothesis can be condensed into two key claims:
- The brain has mechanisms for credit assignment during learning that allow it to optimise global functions in multi-layer networks by adjusting the properties of each neuron to contribute to the global outcome
- The brain has mechanisms to form highly tunable cost functions, shaped by evolution and matched to the ethological needs of the animal. So, the brain uses cost functions as a key driving force of its development, akin to neural networks in machine learning.
Many theories of the cortex look at potential self-organising and unsupervised learning properties that show the need for multi-layer backpropagation in the brain. The well established Hebbian plasticity considers the adjustment of weights according to pre-synaptic and post-synaptic activity. This can give rise to different correlations and competition between neurons, forming self-organising maps.
However, this local self-organisation is insufficient to account for the brain’s powerful learning abilities. So, there might be some form of gradient descent that allows the brain to learn more efficiently. There are some biological substitutes of gradient descent that have been proposed, such as Contrastive Hebbian Learning, Generalised Recirculation, and others. All of these involve some form of feedback connections that carry error physically. An example of this is spike timing dependent plasticity (STDP), where in some neurons, the sign of the synaptic weight change depends on the relative timing of the pre-synaptic and post-synaptic spikes. Although this is conventionally interpreted as Hebbian learning, an alternative interpretation more appropriate for the context of neural networks is that neurons can encode the types of error derivatives needed for backpropagation in the temporal derivatives of their firing rates. Another way neurons can approximate backpropagation is feedback alignment. Here, the feedback pathway in backpropagation, is replaced by a set of random feedback connections with no dependence on the forward rates. This mechanism of computing error derivatives works well in most tasks, subject to the existence of synaptic normalisation mechanism and approximate sign-concordance between feedforward and feedback connections. In effect, the forward weights are able to adapt to bring the network into a regime in which the random backwards weights actually carry the information that is useful for approximating the gradient. We should note that there some differences in the way biological neurons and artificial neurons behave. Biological neurons can only be excitatory or inhibitory, but artificial neurons can be both. Furthermore, biological neurons are highly recurrent in nature and most often communicate via spikes.
Backpropagation through time (BPTT) is usually used for the training of recurrent networks, but there isn’t a similar biological alternative and it seems implausible to have a biological alternative to being able to unfold the recurrent cells over time steps to perform gradient calculations. However, given some appropriate form of memory storage and representation, there might not be a need for BPPT in biological neurons.
There are multiple general strategies for addressing whether and how the brain optimizes cost functions. A first strategy is based on observing the endpoint of learning. If the brain uses a cost function, and we can guess its identity, then the final state of the brain should be close to optimal for the cost function. If we know the statistics of natural environments, and know the cost function, we can compare receptive fields that are optimized in a simulation with the measured ones. This strategy is only beginning to be used at the moment because it has been difficult to measure the receptive fields or other representational properties across a large population of neurons, but this situation is beginning to improve technologically with the emergence of large-scale recording methods.
A second strategy could directly quantify how well a cost function describes learning. If the dynamics of learning minimize a cost function then the underlying vector field should have a strong gradient descent type component and a weak rotational component. If we could somehow continuously monitor the synaptic strengths, while externally manipulating them, then we could, in principle, measure the vector field in the space of synaptic weights, and calculate its divergence as well as its rotation. For at least the subset of synapses that are being trained via some approximation to gradient descent, the divergence component should be strong relative to the rotational component. This strategy has not been developed yet due to experimental difficulties with monitoring large numbers of synaptic weights.
A third strategy is based on perturbations: cost function based learning should undo the effects of perturbations which disrupt optimality, i.e., the system should return to local minima after a perturbation, and indeed perhaps to the same local minimum after a sufficiently small perturbation. If we change synaptic connections, e.g., in the context of a brain machine interface, we should be able to produce a reorganization that can be predicted based on a guess of the relevant cost function. This strategy is starting to be feasible in motor areas.
Lastly, if we knew structurally which cell types and connections mediated the delivery of error signals vs. input data or other types of connections, then we could stimulate specific connections so as to impose a user-defined cost function. In effect, we would use the brain’s own networks as a trainable deep learning substrate, and then study how the network responds to training. Brain machine interfaces can be used to set up specific local learning problems, in which the brain is asked to create certain user-specified representations, and the dynamics of this process can be monitored. In order to do this properly, we must first understand more about the system is wired to deliver cost signals. Much of the structure that would be found in connectomic circuit maps, for example, would not just be relevant for short-timescale computing, but also for creating the infrastructure that supports cost functions and their optimization.
Hypothesis 2: Cost functions are diverse across brain areas and time
In this section, we propose that the brain optimises cost functions that not the same in all layers of the network but can change over the architecture of the network.
Clearly, we can map differences in structure, dynamics and representation across brain areas. When we find such differences, the question remains as to whether we can interpret these as resulting from differences in the internally generated cost functions, as opposed to differences in the input data, or from differences that reflect other constraints unrelated to cost functions. For example, we can use methods from inverse reinforcement learning to decode the cost functions at different parts of the structure.
Moreover, as we begin to understand the “neural correlates” of particular cost functions – perhaps encoded in particular synaptic or neuromodulatory learning rules, genetically-guided local wiring patterns, or patterns of interaction between brain areas – we can also begin to under- stand when differences in observed neural circuit architecture reflect differences in cost functions.
We expect that, for each distinct learning rule or cost function, there may be specific molecularly identifiable types of cells and/or synapses. Moreover, for each specialized system there may be specific molecularly identifiable developmental programs that tune it or otherwise set its parameters. This would make sense if evolution has needed to tune the parameters of one cost function without impacting others.
How many different types of internal training signals does the brain generate? When thinking about error signals, we are not just talking about dopamine and serotonin, or other classical reward-related pathways. The error signals that may be used to train specific sub-networks in the brain, via some approximation of gradient descent or otherwise, are not necessarily equivalent to reward signals. It is important to distinguish between cost functions that may be used to drive optimization of specific sub-circuits in the brain, and what are referred to as “value functions” or “utility functions”, i.e., functions that predict the agents aggregate future reward. In both cases, similar reinforcement learning mechanisms may be used, but the interpretation of the cost functions is different.
Hypothesis 3: Optimisation occurs in the context of specialised structures
Optimisation of blank slate neural networks will not be sufficient to generate complex cognition in the brain, even given a powerful combination of genetically specified cost functions and local learning rules. Therefore, we propose that the brain has different specialised structures doing different forms of optimisation that ultimately optimise a joint goal defined by the cost functions discussed above.
If different brain structures are performing distinct types of computations with a shared goal, then optimization of a joint cost function will take place with different dynamics in each area. If we focus on a higher level task, e.g., maximizing the probability of correctly detecting some- thing, then we should find that basic feature detection circuits should learn when the features were insufficient for detection, that attentional routing structures should learn when a different allocation of attention would have improved detection and that memory structures should learn when items that matter for detection were not remembered.
We now look at predictive control. We often have to plan and execute complicated sequences of actions on the fly, in response to a new situation. At the lowest level, that of motor control, our body and our immediate environment change all the time. As such, it is important for us to maintain knowledge about this environment in a continuous way. The deviations between our planned movements and those movements that we actually execute continuously provide information about the properties of the environment. Therefore it seems important to have a specialized system that takes all our motor errors and uses them to update a dynamical model of our body and our immediate environment that can predict the delayed sensory results of our motor actions. It appears that the cerebellum is such a structure, and lesions to it abolish our way of dealing successfully with a changing body. Newer research shows that the cerebellum is involved in a broad range of cognitive problems as well, potentially because they share computational problems. For example, when subjects estimate time intervals, which are naturally important for movement, it appears that the brain uses the cerebellum even if no movements are involved.
Importantly, many of the control problems we appear to be solving are hierarchical, so we propose some form of hierarchical control in the brain. We have a spinal cord, which deals with the fast signals coming from our muscles and proprioception. Within neuroscience, it is generally assumed that this system deals with fast feedback loops and that this behavior is learned to optimize its own cost function. The nature of cost functions in motor control is still under debate. In particular, the timescale over which cost functions operate re- mains unclear: motor optimization may occur via real-time responses to a cost function that is computed and optimized online, or via policy choices that change over time more slowly in response to the cost function. Nevertheless, the effect is that central processing in the brain has an effectively simplified physical system to control, e.g., one that is far more linear. So the spinal cord itself already suggests the existence of two levels of a hierarchy, each trained using different cost functions.
Much of neuroscience has focused on the search for “the neural code”, i.e., it has asked which stimuli are good at driving activity in individual neurons, regions, or brain areas. But, if the brain is capable of generic optimization of cost functions, then we need to be aware that rather simple cost functions can give rise to complicated stimulus responses. This potentially leads to a different set of questions. Are differing cost functions indeed a useful way to think about the differing functions of brain areas? How does the optimization of cost functions in the brain actually occur, and how is this different from the implementations of gradient descent in artificial neural networks? What additional constraints are present in the circuitry that re- main fixed while optimization occurs? How does optimization interact with a structured architecture, and is this architecture similar to what we have sketched?
On the other hand, much of machine learning has focused on finding ever faster ways of doing end-to-end gradient descent in neural networks. Neuroscience may inform machine learning at multiple levels. The optimization algorithms in the brain have under- gone a couple of hundred million years of evolution. Moreover, the brain may have found ways of using heterogeneous cost functions that interact over development so as to simplify learning problems by guiding and shaping the outcomes of unsupervised learning. Lastly, the specialized structures evolved in the brain may inform us about ways of making learning efficient in a world that requires a broad range of computational problems to be solved over multiple timescales. Looking at the insights from neuroscience may help machine learning move towards general intelligence in a structured heterogeneous world with access to only small amounts of supervised data.