Intelligence is not primarily about optimization or spikes, but about how information is represented
A debate emerged last week between Yann LeCun and Mike Davies (Intel, Loihi chip). Yann said he’s not convinced spikes are essential and Mike said the current state-of-the-art (SOTA) machine learning (ML), i.e., the Deep Learning (DL), models, are not really learning, but rather just doing optimization. Yann has a thoughtful Facebook post recapping it. I want to present my view of the themes in the debate.
First, as I think all would agree, in the brain, at the most fundamental level, learning = synaptic change. So all of the SOTA DL models are in fact learning. They are doing what I will call “learning-as-optimization” (LAO). This can be contrasted with what I will call, “learning-as-memory” (LAM), which my own approach, Sparsey, exemplifies. LAM is really synonymous with “associative memory”, with roots going back to Marr, Willshaw, Palm, Kanerva, and others. I’m not including Hopfield here because Hopfield nets, like Boltzmann machines, i.e., energy-based models, are doing a form of optimization. Interestingly, I saw a recent talk by Geoff Hinton, in which he describes how the ML community basically settled en masse on LAO, instead of LAM, by the early 90’s, and basically never looked back. [Although “Memory Networks”, “Neural Episodic Control”, etc., are essentially attempts to bring memory back in, though it’s being brought back in to an LAO framework.]
First, regarding optimization-based learning.
Why is distinguishing LAO from LAM important? Well, to cut to the chase, I claim that the brain’s cortex is NOT doing optimization. The brain’s cortex is where all of one’s long-term memories, both specific (episodic) and general (semantic), reside. I claim that the cortex IS doing memory and further that most of the learning by which memories accrue in cortex is non-optimization-based unsupervised learning. However, while the cortex per se is not doing optimization, the brain as a whole does do a form of optimization, namely, reinforcement learning (RL) (which involves many other structures besides cortex). Error-driven supervised learning probably accounts for just a very small percentage of what people actually learn.
A central issue at the core of the debate is power. It is true that LAO Deep Learning (DL) approaches, have been dominating virtually all the ML benchmarks for quite some time. But it is also true that LAO DL approaches require massive power. ImageNet can now be learned in ~4 minutes, which is great, but it requires 2,176 GPUs, which is not great! I’m not a hardware guy, but I assume that’s in the neighborhood of 400–500 kilowatts. The brain uses 20 watts. [I would like to know how long the DL model referenced would take to learn ImageNet if it ran on a single CPU process. Such a metric would be helpful for comparing the actual algorithms, i.e., a non-machine-parallelism baseline.]
So, why does LAO DL need so much power? The core problem with LAO is that, to first approximation, it requires computing, for each exemplar, for each epoch (or SGD minibatch):
A) the gradient of a global objective fn, i.e., the derivative, or “error”, for each weight (this is true even if only the sign of the gradient is ultimately used, e.g., RProp, RMSProp), and
B) the amount by which to change each weight, i.e., the “delta”.
So, if a model has order billions of weights (which is quite reasonable, going forward), that’s a lot of computation. Is there an alternative?
Yes: with a LAM approach, and more specifically, a binary LAM approach that uses a purely local, Hebbian-type, learning law, neither A nor B is needed. Clearly, A is not needed since there is no global objective. But also important is that B is not needed. That is, if weights are binary and all weights are initially 0, then the only possible change that will ever be made to a weight is to change it from 0 to 1. In that case, there is no need to ever compute the value of a delta. So, no computation, and therefore no time or power, is expended on A or B. To be clear, in both cases, LAO and LAM, the actual physical change to the weight is still needed. BUT, LAO actually requires compute cycles to determine that change, for each weight, for each exemplar, for each epoch or minibatch, which gives a huge computational advantage to binary LAM scheme. Ultimately, and probably soon, neuromorphic algorithms will be running on “neural fabric” architectures, i.e., essentially 2D (or 3D) crossbars with analogs of synapses, e.g., memristors, spintronic elements, at the junctions. In that case, the actual physical weight change will be done locally and via “the physics”, i.e., not requiring computation, and so again, the energy consumed by the actual physical weight changes will be a wash between LAO and LAM.
I believe the above observation provides strong motivation for pursuing LAM over LAO. But I think the following is also worth emphasizing. I’m not aware of any hard neuroscience evidence proving that the learning that occurs while any organism (or portion of an organism) learns some naturalistic, higher-level (e.g., symbolic) input-output relation, actually occurs in the form of small synaptic weight deltas, i.e., small relative to their whole range. Stated differently, based on experimental evidence, we cannot rule out the possibility that most, even all, learning that we would consider to underlie intelligence, actually occurs as large deltas, or more to the point, as effectively maximal, i.e., binary, deltas, as in the LAM framework described above. Yet, virtually all LAO approaches use a tiny learning constant (which may change during training), and thus, tiny weight deltas. If the preponderance of neuroscience evidence eventually establishes that in the brain, synaptic weight changes are effectively large (or maximal) with respect to their ranges, then how does that square with this long-standing fundamental characteristic of optimization-based learning?
This prior point is of course not a proof that LAO is not occurring in cortex, but it is an important point to keep in mind. However, there are stronger reasons why the current SOTA LAO approaches are not biologically plausible. Of course, there are the perennial arguments against Backprop. But if one steps back and really thinks about it, convolution is surely biologically implausible, as I’ve pointed out here and in this web essay. However, the most important reason for thinking LAO is not occurring in cortex is that there is an existence proof, Sparsey, of a far more efficient, non-optimization-based, on-line, unsupervised learning method, which learns via single trials and extremely efficiently (properties a, b, and c, below) , measured in terms of either time or power.
Second, what about spikes?
First, I think most computational neuroscientists would agree that spikes are the essential currency underlying communication and computation in the brain. (Explicit modeling of sub-threshold dynamics may not be needed for a final human-level AI.) In this case, we can say that the most efficient way that nature has thus far figured out to perform intelligent computation is via spikes. But that’s in wetware. It does not mean that a more efficient method cannot be found for silicon (or other hardware technology).
In fact, I think there is a reasonable first principles argument why we might not expect spikes to be the most efficient currency. And here it is. A spike is a short event, which seems an advantage for achieving fast processing. In the brain, a spike is ~1–2 ms long: in silicon, it can be orders of magnitude shorter. So that’s great. But, a spike is defined by two state changes, an upward deflection from a baseline, followed by a downward deflection. Detecting a spike requires recognizing both changes, and this recognition functionality must somehow by realized in the circuitry, either by operations, which take time and power, or by the spatial design of the circuit, which does not take time or power (or by some combination of the two). However, note that it is possible to send the same information contained in any individual spike — i.e., a pure binary event, not including additional temporal information, i.e., when that spike occurred relative to a temporal landmark (e.g., a particular phase of gamma) (spike time, latency code), or a count of the number of spikes in some surrounding time window (rate code)— with only one deflection, either upward from some low baseline, or downward from some high baseline. So, perhaps there exists a computational framework (data structure + algorithm) that can realize the same power and generality as any given spike-based framework, but requiring only single deflections to send any given binary event. Again, I’m not a hardware guy, but I bet there is. And, if such a framework does exist, it seems it would likely be more power efficient than using spikes.
In fact, I think the belief that spikes are essential stems from thinking that all spike codes must fundamentally be temporal, i.e., some form of rate or spike-time (latency) code. But, as I explain in my rejected Cosyne 2019 abstract, this is not true. Once one entertains the possibility that information in the brain is coded as cell assemblies, i.e., sets of simultaneously active neurons, and thus, that the messages sent from one coding field to another take the form of bundles of spikes simultaneously traversing bundles of axons from the source field to the destination field, much more efficient atemporal spike codes become possible. Wulfram Gerstner and colleagues describe one such atemporal scheme in Fig. 7.8 of Neuronal Dynamics, wherein “the population activity is defined as the fraction of neurons that are active in a short interval [my italics]”. But, since only one particular fraction can be active at any particular instant, this scheme can transmit only a single value, e.g., a likelihood of one represented item (hypothesis) in the source field, at any particular instant. My Cosyne abstract describes a far more powerful atemporal spike coding scheme in which the likelihoods of ALL represented items in the source coding field are simultaneously transmitted to the destination field. To my knowledge, Sparsey is the first model to exploit this principle for increasing the efficiency of communication and computation. In particular, none of the mainstream Probabilistic Population Coding (PPC) models, nor Capsules (discussed here), possesses this capability. The underlying representation scheme that makes this atemporal spike coding scheme possible is further elaborated, and empirical results given, in my 2017 arXiv paper, “A Radically New Theory of how the Brain Represents and Computes with Probabilities”. The figure below, from the Cosyne abstract, shows the new coding principle in the context of prior notions.
Interestingly, the Loihi effort referenced in the debate actually implemented an LAO approach on their spike-based platform. I would suggest a Loihi effort based on LAM (and I bet you can all guess what model I think they should try :) Though I don’t think spikes per se are truly needed for intelligence (as discussed above), they certainly are central to the brain’s implementation of intelligence. So, Loihi might possibly be a viable platform for super-efficient, human-level AI. But, to realize its full potential, it should leverage the atemporal spike coding scheme described here, which essentially requires that information be represented in the form of sparse distributed representations (see below), and which in turn suggests a LAM approach.
Finally, the most important thing: how information is represented.
I’ve long been certain that the single most essential thing needed for intelligence is that information be represented in the form of sparse distributed representations (SDRs), i.e., relatively small sets of binary units chosen from a much larger field, e.g., 100 chosen from 5,000. [N.b.: SDR is a completely different concept than “sparse coding”, cf. Olshausen & Field.] This is because SDR makes possible a simple, single-trial, unsupervised learning algorithm:
a) which preserves similarity from input space to code space (i.e., maps more similar inputs to more highly intersecting SDRs); and
b) for which the number of steps needed to store an item remains constant as the number of stored items increases.
In fact, a corollary of possessing these two properties is that the model is also able to:
c) retrieve the closest-matching stored item with a number of steps that also remains constant as the number of stored items increases.
Amongst other things, this means that such a model, e.g., Sparsey, constitutes a more efficient implementation of the functionality of adaptive locality sensitive hashing (LSH), i.e., effectively constant time complexity for both learning (storage) and best-match retrieval (probabilistic inference). Moreover, the use of SDR has profound implications regarding “learned multidimensional indexes”, as discussed in this article.
Despite the impressive success of the LAO approach, its power requirements (or time requirements if machine parallelism is not used) for learning are too great by 4–5 orders of magnitude. This and other reasons, e.g., the biological implausibility of Backprop and of convolution, should be pushing the community to explore other approaches. The largely ignored alternative is LAM, wherein the primary function of the brain’s cortex is memory, and more specifically, episodic memory, i.e., unsupervised, on-line, single/few-trial creation of memory traces (engrams) of experienced inputs. Though this episodic memory function is primary, if the right representation is used, i.e., SDR, and if the learning process preserves similarity, then the progressive accumulation of those engrams in superposition automatically yields semantic memory, i.e., knowledge of the statistical (similarity, class) structure over the inputs, as a by-product. Specifically, that knowledge emerges in the patterns of intersections over those engrams. And, the fact that episodic and semantic information reside in superposition in this LAM approach provides a further key source of computational efficiency: it precludes having to transfer information back and forth between a semantic and an episodic module, as is the case in more recent “Memory Networks”-type approaches, which have been cast as extensions to the LAO framework.
Information is communicated and processed in the brain via spikes. But spikes are just the best way, so far, that nature (wetware) has figured out to implement intelligence. While a spike-based scheme in silicon may be successful, spikes per se are not central to intelligence. It’s the way information is represented that is most essential. There is mounting evidence that the cortex (and other structures) represents information in the form of SDRs, a.k.a., “cell assemblies”. The core concepts relevant to the debate — LAO, spikes, power efficiency, and my addition, LAM — are all clarified when considered in the more fundamental context of how information is represented. As it happens, this core issue of how information is represented in the brain is also at the center of a soon-coming massive paradigm shift in neuroscience, in which the 100+ year old “functional neuron doctrine”, will be replaced by what I will call the “functional cell assembly” doctrine (cf. Rafael Yuste, Josselyn & Frankland, Tonegawa, and others).