I’m an AI optimist. One of those lads without any neuroscience background who assuredly claims that “the brain is just a carbon-based computer that can rewire itself”. I also believe that the answer to difficult questions usually begins with what feels right in your gut — complexity is often solved with surprising simplicity.
This sometimes leads to me writing sentences like “Neural Networks Allow AI to Learn Like Your Brain Does”. This is the tech journalism equivalent of “Doctors Hate This One Easy Trick”. Modern deep learning models are stupendous and complex, but they’re far from being truly biological.
Yet we remain fixated on the idea of mechanically recreating the human brain, because from a computational standpoint, it’s so damn impressive.
A Fleshy Pattern Recognition Machine
Your brain is really good at doing more with less work. Neurons fire anywhere from .1–2 times per second, but 90% rarely fire at all. Only a small percentage of your neurons are firing at any given moment; 2% is thrown around a lot.
The brain’s core rule: Context matters.
Analyzing new sights and experiences is relatively taxing, so the brain saves a lot of effort by learning sequential inputs & constantly predicting the next input. This is why we can be surprised.
If you walk down the cereal aisle while you’re not looking for cereal, your brain doesn’t spend energy analyzing every box on the shelf, because you’ve already learned the pattern. But you’ll notice an raccoon snarling at you between the corn flakes, because it deviates from the sequence you’re used to.
One model on the fringe of deep learning is hoping to approach the brain’s pattern recognition capability. Hierarchical Temporal Learning is a relatively new approach to both general intelligence and unsupervised machine learning that seeks to mimic the architecture of the neocortex. It’s relatively new, but shows great promise due to its inherent noise resistance and ability to train on less data.
More importantly, the idea feels quite right to me. It’s a simple concept (with complex implementation); if you want the ability of the brain, copy its structure. Function follows form.
The Limits of Deep Learning
Contemporary standard neural nets usually look something like this:
Layers of multiple nodes are stacked. Each node is fully connected to every node in the adjacent layers.
Each node has a value, and each connection has a weight, which lets it figure out “how much nodes influence each other through connections”.
A common example is binary classification: Convert 10,000 cat and dog pictures to arrays of numbers and feed them into the input layer for training, and eventually the net can predict quite well which of the two animals is in a picture it’s never seen before.
Classic neural nets ‘learn’ by updating the weights of each synapse (connection) based on each training sample. If the net I described receives a cat, analyzes it and guesses ‘cat’, the synaptic weights that lead to the correct answer will be strengthened (and weakened if it guessed wrong).
NNs are a powerful tool for all sorts of pattern recognition, and are used today in everything from computer vision to genomic sequencing. However, they face three persistent challenges:
- They need lots of samples for training (100 cats isn’t enough)
- They don’t deal well with sudden changes in input data (show it a jaguar)
- They can be fooled by ‘noise’ (a grainy/low-light photo)
Some variations of architectures attempt to tackle these issues; one-shot learning can work well with less data and LSTM networks show promising noise resistance. And neural nets are still the top performer in most areas they’re employed in.
But if these problems are inherent to the classical “fully connected stacked layers” architecture, then they can be sidestepped with an entirely different structure.
Imagine a thin sheet of flesh ~1.6 feet on each side, filled with billions of neurons and folded around your brain.
This is the neocortex: the part of you responsible for cognition, most sensory perception, language, spatial thinking and movement. Pretty important stuff. It does this through billions of neurons, each composed of thousands of synapses.
Neurons are stacked into tightly-packed cortical columns, where neurons can receive input from synapses (connected to finger/eye nerves, other neurons etc).
This is where it all gets a bit complicated, involving dendrites and proximal/distal synapses (Rotbart gives a great explanation of this). A very simplified overview:
Each neuron has thousands of synaptic connections to other neurons, but only ~10% (proximal dendrites) can actually cause a neural spike. If one of these activates, the neuron itself fires up and sends signals through its outbound synapses.
The other 90% (distal dendrites) handle pattern recognition; if ~10 of one neuron’s tightly-clustered distal dendrites fire together, it causes a dendritic spike. This puts the neuron in a ‘predictive state’. A predictive neuron will react to proximal stimulation faster than non-predictive neurons, and when it fires it also prevents neighboring (distally connected) neurons from firing. This is the key to your brain’s “work smarter, not harder” policy.
Notice how there’s no ‘weights’ in real synapses: neurons work with binary logic. It’s not about the strength of individual connections, but whether connections exist and how many there are.
Here’s a neuron compared to the Hierarchical Temporal Memory representation:
We can see that both types of dendrites are represented; green is proximal, blue is distal. These artificial neurons attempt to recreate both the proximal neural spike and the distal dendritic spike, so the distal dendrites are clustered together to determine whether or not to trigger a predictive state by collective input.
These HTM neurons are organized into a structure mimicking the cortical column. “Sequence Memory” in the middle below indicates multiple micro-columns of neurons (colored circles are active neurons).
When looking at sequence memory from above, we can generate a Sparse Distributed Representation. SDRs are a byte-array where 1 = at least one neuron in the micro-column is active, and most of it is 0.
Many HTM models set a ‘fixed sparseness’ around 2%; a model with SDR sized at 2048 will have about 41 active microcolumns represented as 1. This is remarkably close to our current understanding of how the brain transmits information, and SDRs can be measured against each other with computationally-efficient bit comparison.
The existence and configuration of the HTM synapses themselves are calculated in a binary manner consistent with the fleshy version:
During training (or ‘feeding’) the net adjusts each potential or existing synapse’s “permanence”, and grows or severs the connection beyond a set threshold. The actual permanence value doesn’t matter for computation, only whether or not the synapse exists.
Sequential Pattern Recognition
HTM models ‘learn’ by receiving a series of SDRs, adjusting synapses accordingly, and and eventually predicting the next input from any SDR.
These boxes filled with columns of dots are vertical slices of the Sequence Memory from the last image — you’re looking at it from the side, seeing the X and Z axes of a 3D neuron array. The slice is organized by “micro-columns”, which mimics distal interconnection of neurons.
On the top half we feed the HTM net the SDRs A B C D in order, and then we feed it X B C Y. Input feeding uses ‘bursting’, where we activate all neurons in a microcolumn.
On the bottom half we only feed in one SDR and the net predicts the rest. When faced with A, the net puts some neurons in a predictive state: notably, certain neurons in the same column that B bursted. This combination of predicted neurons is labeled B1.
The predictive-state neurons of B1 activate, inhibiting all other neurons in their columns; this is the biological mimicry, the “lazy efficiency” of our own brains.
The activation of B1 puts the neurons of C1 in a predictive state, and it continues to D1.
But we also fed it X B C Y. So if we give it X, it predicts B2, or what I call “B_from_X” distinct from “B_from_A”. Notice how B1 and B2 predict different neurons within the same column — that’s the magic behind the curtain. They predict the same style of input, but with different specifics based on context. Following this, C2 (C_from_B_from_X) differs from C1 (C_from_B_from_A).
This is still fairly straightforward. But since there are two “paths” after B, what if we just tell the trained net to start at B?
B predicts C1 and C2, since it “knows” B follows two different inputs and can branch out accordingly. So HTM effectively uses the brain’s “grouping” and “collective voting” structural mechanisms, making it extremely resilient.
This is what gives it noise resistance: most HTM models can run at similarly accuracy even when 40% of neurons are randomly destroyed.
ABCD and XBCY is a bit abstract, so let’s talk about something more concrete.
Imagine you’re a loner who only eats chicken tenders, and you’ve only ever been to your new friend’s house twice: last week Monday afternoon and Wednesday evening.
Monday afternoon: they served you lunch, you ate the sandwich, you watched a movie on their couch.
Wednesday evening: they served you dinner, you ate the spaghetti, you had indigestion.
Now if you suddenly think of Monday afternoon, you might think of eating lunch at your friend’s place and watching a movie, and thinking Wednesday afternoon would yield its own train of thought.
If you think of burgers you might think of movies, and you might equate spaghetti with bad times.
But what if you just think of “your friend serving you food” in general? You might think of burgers and spaghetti, and then a movie and/or indigestion.
Thinking of ‘eating at your friend’s place’ without further details is like receiving a bursting input of B after knowing B_from_A and B_from_X. It’s the general version of an experience without specific context.
Similarly, thinking of ‘going to your friend’s place’ without thinking of afternoon or evening could mean thinking of lunch or dinner.
Just like B1 and B2 are distinct combinations of neurons in the same column-arrangement, you have similar experiences (movie or indigestion?) following different contexts (burger or spaghetti?). Context matters.
As the name implies, Hierarchical Temporal Memory is a natural fit for data that has a temporal or sequential element. Some companies are already running it under the hood.
Cortical uses HTM for Natural Language Processing and sentiment analysis. This makes sense to me; sentences are more than the sum of their parts, and stringing the same words in different sequences can have completely different meanings.
Intelletic runs Cortical Learning algorithms (another way of saying HTM) to predict stock prices based on prior price & date sequences. This also seems quite level-headed; neural nets are already employed for time series financial modeling, so adding a temporal understanding could lead to a greater understanding of ‘what kind of crash/run is’ based on context.
However, there’s plenty of work still to be done. Despite the name, there’s still no actual hierarchical element implemented in HTM — it’s very much a work in progress, and evolves as new developments are made in neuroscience. However, initial implementations already show great promise in noise resilience and lower-data training.
The primary developer of HTM libraries is Numenta; they’ve developed a whole open-source community with many sub-libraries for different tech like PyTorch and clustering models. I’m currently trying to convert image data into an SDR & messing with environment control.
If you’re looking to get started, head to their website to check out the tutorials or have a look at their examples on GitHub.
And remember to take care of your brain. It’s doing an awful lot of work for your sake.