Learning Without Gradient Descent Encoded by the Dynamics of a Neurobiological Model
In general, the tremendous success and achievements of the many flavors of machine learning (ML) are based on variations of gradient descent algorithms that minimize some version of a cost or loss function. At their core, existing algorithms take advantage of the stochastic convergence of weights in neural networks, with individual nodes typically expressed as some version of the classical perceptron model, and the network’s ability to capture latent non-trivial statistical associations that encode inputs. A fundamental limitation, however, is the need to train these systems in either supervised or unsupervised ways by exposing them to large numbers of training examples. In some situations these limitations pose significant issues, such as when there simply is not sufficient data (or enough high quality data) for training, or when tasks such as learning and classification need to be done ‘on the fly’ in near real time to support just-in-time inference or decision making.
In addition to requiring large amounts of labeled data for training, state-of-the-art ML models such as GPT3, at 175 billion parameters, require a huge compute infrastructure. The compute cost of the GPT3 training cycle alone is estimated to be $4.6M. Current large data, large compute, and large model trends in ML will not scale. And while some existing ML can perform near real time learning, it still requires expensive pre-trained models. Clearly a more data, compute and energy efficient paradigm for ML is needed.
In a recent conference paper we introduced a fundamentally novel conceptual approach to ML that attempts to begin to address these limitations. We took advantage of a recent construction and theoretical framework developed by our group — derived from an abstraction and analysis of the canonical neurophysiological principles of spatial and temporal summation. We showed that when artificial neural networks (ANN) are constructed with a defined geometric and connectivity structure the interplay between this structure and dynamic variables (conduction velocities and node refractory states) allow information (inputs) to be encoded by the resultant dynamics of the network. Learning, in the traditional sense of adjusting weights, can still occur but in a much more efficient manner. No a priori training of the network is required. The dynamics capture the characteristics of the inputs. As a proof of concept, we use these methods for unsupervised classification of MNIST digits. An expanded follow up paper later this year will discuss the approach and methods in more detail, and provide a number of additional results.
Conceptually, the framework models the competing interactions of signals incident on a target downstream node (e.g. a neuron) along directed edges coming from other upstream nodes that connect into it. The model takes into account how temporal latencies produce offsets in the timing of the summation of incoming discrete events due to the physical geometry of the edges, in addition to the connectivity, and how this results in the activation of the target node. It captures how the timing of different signals compete to ‘activate’ nodes they connect into. At the core of the model is the notion of a refractory state for each node. This reflects a period of internal processing at the individual node level. The model does not assume anything about the internal dynamics that produce this refractory state. Given these results, an extension of the classical model of the perceptron was achieved, a geometric dynamic perceptron that is a generalization of integrate and fire models in neuroscience. This model incorporates a timing constraint to the summation of arriving signals and subsequent edge weights as a function of edge path lengths and the neuron’s refractory period. See the main theoretical paper for the full details and mathematical proofs. And this other paper for additional related work. In the results and discussion here, we took advantage of the activation paths generated by the dynamics of the model, induced by inputs, i.e. activated pixels from MNIST digits. MNIST is a popular database of thousands of handwritten digits between 0 and 9 often used as a test set for ML applications — albeit a rather simple one these days compared to other standardized test sets used by the ML community.
The rest of this article summarizes the technical details and results. The resultant computed paths were used with or without spike-timing-dependent plasticity (STDP) directly (in the next section) or with STDP on the weights (in the last section) to encode features for back end classifiers.
Unsupervised Classification from Structural Paths Derived from Network Dynamics with No Training
We first explored if patterns of activation using the model could effectively separate out input patterns without any associated learning. In this experiment, the underlying geometric structure of our model consisted of a network with 2 blocks from a stochastic-block model (SBM); constructed as a 784 node input block, which matched the resolution of the MNIST data, connected to a 200 hidden node block. We applied 60,000 training examples from the MNIST dataset as input stimuli. For each stimulation we ran our model for 10,000 steps, thus generating 60,000 unique activation patterns or temporal graphs. We then generated embeddings for these graphs using their temporal sequences (paths). We visualized each embedding of a temporal graph in Euclidean space (Fig. 1) by plotting the first three components of a principle component analysis (PCA). Using cosine distance as the distance metric, we constructed a k-nearest neighbor (kNN) classifier to infer input class labels based on the majority labels of the closest neighbors. Using this unsupervised method, we were able to achieve an average accuracy of 72% on the task of infer-ing the correct MNIST input class, with class 1 achieving the highest accuracy of 93% and class 8 with the lowest accuracy of 63%.
In a next experiment, we compared embeddings generated by the model with STDP as a learning rule to embeddings generated by the model without STDP (Fig. 2).
STDP is a plausible biological algorithm that adjusts the strength of synaptic associations between connected neurons. The process adjusts synaptic weights based on the timing of arriving signaling events. Starting with the same underlying geometric structure of a network, we randomly selected sets of 1000 instances of MNIST images of class 1 and 5 as input stimuli. With STDP enabled, we stimulated the model with the 1000 instances of images. We then took a snapshot of the graph with the updated weights. When we compared the two graphs we observed that 29% of the edges had higher weights, 40% of edges had lower weights, 20% became inhibitory, and 11% of the edge weights were unchanged. We then ran two sets of simulations, both with a single additional stimulus of either class 1 or 5; one took advantage of STDP and the other did not. We then extracted the dynamic paths from these additional stimulations and embedded them using the same methods as described in the previous paragraph. Using the same unsupervised kNN method, we inferred class labels. For embeddings generated with the dynamic paths without STDP, we achieved a 61% accuracy separating out the two input classes. But for dynamic paths with STDP as a learning rule, we achieved 82% accuracy. By comparison, if we trained a support vector machine (SVM), we were able to achieve 91.1% accuracy with embeddings from non-STDP paths and 97.6% for those with STDP.
Unsupervised Classification from State Space Trajectories of Edge Weights with No Training
In the previous section and related work activation patterns generate unique temporal sequences carved by the dynamics of the network that are able to relatively well classify the inputs they represent (given that they are not trained to do so). Building on this work, we tested whether STDP-imposed changes on the weights themselves could encode inputs as a function of the resultant dynamics using our theoretical model. In contrast to the traditional training and testing phases of an ANN, we instead observed the state-space trajectory of the evolution of the weights in the recurrent layer. Inputs of the same class of MNIST digits resulted in similar edge weight changes. Finally, we used a simple Euclidean distance weighted kNN (w-kNN, k = 5) to quantify the similarity of edge weight state-space trajectories.
To initialize the networks, we used neurobiologically relevant parameter values. Weights were chosen from a uniform distribution, such that 70% of the edges were excitatory while 30% were inhibitory.The network consisted of an input layer and a recurrent layer. We observed the resultant global dynamics, i.e. edge weight states, of the network after a single stimulation. The input layer consisted of the same 784 nodes. We varied the number of recurrently connected neurons up to 400. We did not test bigger networks because the classification accuracy did not significantly change beyond 200 nodes, although computational demands increased. Input nodes were activated by the non-zero pixels in a particular MNIST image. Each input node connected to every recurrent node, but the outgoing signals from each input node arrived at different times to each recurrent node due to the variability in edge delays. Individual simulations were carried out for 600 ms, a window that empirically exceeded the period of convergence to maximal classification accuracy. For each simulation, we sampled all the weights in the recurrent layer every 100 ms of simulated time. We used the resulting vectors for each time point in a w-kNN algorithm to determine the input class of specific ones. In particular, we randomly chose 9000 edge weight vectors to set up the weight space and then classified the remaining 1000 vectors using w-KNN (n=5). We did this ten times to avoid any selection bias in the weight space and for the classified vectors. Using a euclidean w-kNN classifier metric, the model accurately predicted the correct digit 96.49% of the time using a 200 node network after 300 ms of simulated time with no training (Fig. 3). We tested different sized networks ranging from five to 400 nodes in the recurrent layer. A five node network achieved a classification accuracy peak of 71.66%. A 200 node network achieved the highest accuracy (96.49%) for the most economical size. For larger networks (300 nodes and greater), the classification accuracy achieved a peak of 96.48% that did not scale further with size of the recurrent layer. Furthermore, all the networks achieved their highest respective accuracy at about 300 ms of simulated time.
These initial proof of concept results show how images can be uniquely encoded in two different ways by the dynamics of geometric networks capable of achieving relatively high accuracy unsupervised classification, without the need for any training of the network. We achieved this by taking advantage of temporal sequences of activation patterns with and without STDP, and dynamical STDP mediated structural edge weight changes.
To the best of our knowledge these results are the first of their kind. Conceptually, what the dynamics of the network capture, or encode, is the fact that similar images cause similar firing patterns that result in similar weight changes when STDP is applied. In other words, vectors for the same digit class are pushed through the state-space in similar directions. The resultant evolving weight state-space dynamics are sufficient to encode the latent information that characterizes the input images.
Many open questions about how and why this approach is successful remain to be fully explored. However, the functional constraints imposed by the geometrical construction of the networks on the dynamic model appears to be the key to how information can be encoded and separated without the need for training. From a practical perspective, this fundamentally new non-gradient descent approach to machine learning and inference opens up completely new applied directions and uses.
This piece is part of the collection ‘The Technical Paper Reboot’ — Short adaptations of some of our technical papers that in particular highlight the limits and boundaries of neuroscience. We refer the reader to the technical papers in the links in the article for full details and citations.