A Roadmap to Human-like AGI

Examining Human intelligence to understand what we’re missing and what the future of AGI might look like

Malcolm Lett
33 min readMar 25, 2024
Source: author + Dall-E

It’s early 2024 and the world is abuzz with the talk of Artificial General Intelligence (AGI). ChatGTP has stirred up and challenged many previous assumptions about the limitations of AI, leading some into a frenzy over the possibility that AGI may be just around the corner. While others bemoan the fundamental limitations with the architecture behind LLMs. Even Sam Altman says that we need another major breakthrough.

How do we make sense of all this, and if we were to actually build AGI, what would it look like? I’m going to put some perspective to all the buzz, and hopefully show a path towards AGI.

Human Intelligence
The 4 Layers of Human Intelligence
Reflexes
Habitual Behavioral Control
Goal-oriented Behavior
Habitual vs Goal-oriented Behavior
Culture
Modern ML
Classifying Deep Learning Systems
Listening to your own Speech
Model based Learning that actually learns
Model-based without Model-free
Online Learning
LLMs Today
The Roadmap
An Inspiration for Model-based Cognition
Learning the Model
Learning to Navigate
LLMs to the Rescue
Taking the first Steps
References

Source: author + Dall-E

Human Intelligence

…the definition of AGI that I shall follow for this article is loosely “human-like intelligence”

Before we can get anywhere, we need to address the elephant in the room: what is AGI? There is no agreed definition. Many dislike the term itself because it is based on what turned out to be a mistake: we previously thought that humans have “general” intelligence, and later realized that humans are just very good at using a mixture of many “specialized” capabilities. We aren’t true generalists, so it’s probably unrealistic (and maybe impossible) to target that in artificial intelligence.

We have always used human intelligence as the model of intelligence. The idea of artificial neural networks (ANNs) was inspired by our understanding of the brain. So I think when most people talk about AGI, they are again referring to human intelligence and using it as a model for something that is at least more general than what we’ve currently got with state-of-the-art deep learning.

Thus the definition of AGI that I shall follow for this article is loosely “human-like intelligence”. AHlI anyone? Artificial Human-like Intelligence.

With that out of the way. Let’s take some time to understand what human intelligence is before we go on.

The 4 Layers of Human Intelligence

To understand human intelligence, we need to think about it in terms of behavioral control and learning. The only point of having any form of intelligence is to elicit behavior. Behavior exists to keep the organism alive and procreating. The reason for having better intelligence is to elicit more effective, and more adaptive behavior. And learning enables behavior to be adapted based on experience, rather than hard-coded through evolution.

Better intelligence gives an organism an advantage in an environment that is complex and dangerous. For example, the reasons for humans to have the form and level of intelligence that we do can be traced to evolutionary pressures against our long lineage of ancestors.

Human behavior is the product of not just one mechanism, but at least four behavioral control mechanisms (Bennet, 2023; Dayan, 2008; Sloman, 1998):

Four layers of behavioral control. Source: author*

Reflexes are genetically hard-wired. They provide the backbone of basic survival behavior for all other adaptive and learning behaviors to develop upon. Habitual behavior is learned from experience and carried out relatively automatically based on some trigger. Goal-oriented behavior is the most adaptive form of behavior and is associated with conscious experience. Culture applies a top-down constraint against the range of possible learned behaviors. It might seem strange to add that last entry, but I’ll explain that shortly.

Each of these layers is in constant reciprocal interaction and change. Changes at each layer occur over different timescales, providing an optimum balance between adaptability and the costs of learning. For example, over eons evolution tweaks the hard-coded reflexes, slowly maximizing the way that they help the organisms stay alive. This occurs in response to environmental changes and also to changes in the behaviors of the organisms themselves.

To better understand this, let’s dive into each of these layers in a bit more detail…

Reflexes

In an organism so adaptive that most of its behavior is learned through experience, how does that organism control itself when it hasn’t learned anything yet?

We don’t often think much about our reflexes, but if it wasn’t for them we wouldn’t survive past a few days after birth.

A common theme found in the lineage of evolution from nematode to human is that behaviors that are genetically hard-wired in our earlier ancestors became learnable in our later ancestors and ourselves (Bennet, 2023). In other words, our earliest ancestors responded in fixed ways, not requiring any learning, while much of our behavior is adaptive and has to be learned. In humans, even the most basic things like feeding ourselves need to be learned.

But this leaves a conundrum. In an organism so adaptive that most of its behavior is learned through experience, how does that organism control itself during its earliest development period when it hasn’t learned anything yet? How does that organism do the things necessary for it to thrive, and to avoid doing things that get it killed?

A big part of that is answered by the mother’s care of the infant. But the rest is thanks to our reflexes. One of the first reflexes that matter for a newly developing human is known as the rooting reflex. Place a newborn on a person’s chest (male or female, lactating or not, it doesn’t matter) and the newborn will involuntarily “root around” searching for the nipple to suck on. Likewise, most of the crying of a newborn is likely a reflexive response to any form of discomfort, for example, due to hunger or being at the wrong temperature.

Other examples…. Reflexes trigger the eyelids to shut to prevent damage to one of our most delicate organs. Pain makes us flinch away from things that could damage our other soft and hard tissues. Sleepiness forces us to rest. Hunger drives us to seek sustenance. The list goes on.

Reflexes provide the most basic set of behaviors while we don’t know anything better. Many of those reflexes can be partially controlled in the later stages of a child’s development when their adaptive systems have learned enough to be trusted. For example, we become capable of controlling our response to hunger or emotion. Some reflexes even disappear completely due to automatic developmental processes, such as the rooting reflex, which only lasts for the first 4 to 6 months of an infant’s life.

Reflexes provide something else too that is important in an organism that learns: they provide the first forms of learning signals for the habitual layer. Initially, the habitual control system probably learns to replicate what the organism would do reflexively anyway. In that way, reflexes provide supervised learning against the habitual system, initializing it to a sensible baseline of behavior, upon which more advanced behaviors can be added.

Habitual Behavioral Control

Habitual behavior accounts for most actions by humans. That’s not surprising given how successful it is.

The term “habit” often connotes something negative — something that we need to improve upon. But habits are actually a very good thing. Habitual behavior, also known as automatized behavior, is any learned behavior that can be easily repeated without much thought. It accounts for most actions by humans. That’s not surprising given how successful it is. Habitual behavior likely first evolved in the vertebrates during the Cambrian explosion, around 540 million years ago (Bennet, 2023).

Habitual control wasn’t the first form of learned behavior. So-called conditioned responses were probably the very first form of learning, appearing about 100 million years earlier, in the earliest animals. They enable an organism to associate reflexive responses to novel stimuli. For example, the experience of eating something with a particular taste, followed by stomach pain and vomiting, leads to an association between that otherwise innocuous taste and the need to vomit. However, conditioned responses are extremely limited because they only associate with existing reflexive responses. There is no avenue for the development of novel response behaviors that weren’t already evolutionarily hard-wired.

That’s where habitual control comes in and why it provides such a step up in adaptation. It enables the organism to experiment with different behaviors (ie: spontaneous generation of novel behaviors) and to associate the most effective behaviors with stimuli. It enables, for example, lab fish to learn to push buttons to obtain food (Adron et al, 1973).

If we were to attempt to define intelligence, the most primitive form of intelligence is surely the ability to learn from experience and adapt behavior accordingly. That is what habitual behavior enables.

The processes of the habitual control layer. Source: author

A key part of that is learning; and for learning to occur, you need a learning mechanism that i) selects what to learn and ii) controls how learning takes place. While learned habitual behavior is by definition generated by a cognitive process that is tuned from experience, the learning mechanism is likely predominantly genetically hard-wired. I’ll give two examples.

Firstly, to learn you need to explore the space of possible behaviors, including the generation of behaviors not previously experimented with. One way to produce that is curiosity: acting to investigate or interact with unknown environments and objects. Evolution hard-wires a mechanism that makes us curious.

Secondly, once new behaviors and experiences have been produced, you need some algorithm for incorporating that into the synaptic weights across the billions of neurons potentially involved. It turns out that the learning mechanisms employed in humans (and all vertebrates) are closely akin to the Temporal Difference algorithm and reinforcement learning methods used by the machine learning community today (Bennet, 2023). I won’t go into the details here as there are many excellent blogs describing those algorithms better than I can. The important message is that we have a good understanding of how the human brain learns habitual behavior — good enough that we can build similar systems.

So even habitual behavior is elicited by a combination of learned and hard-wired mechanisms, working together.

Goal-oriented Behavior

The human brain uses model-based control for situations where we don’t have a well-practiced habitual sequence of actions that would be appropriate for a given situation.

Now that we know that habitual behavior in humans is similar to that of our modern reinforcement learning algorithms, we have been able to use that ML knowledge to better understand the functioning of the human brain. One such takeaway is that behavior can be classified as model-free and model-based.

Model-free behavior is where the neural network learns a direct mapping from state to action. Each action causes a new state, and that new state is mapped to a new action, and so forth. In so doing, you elicit a sequence of actions that, in a well-trained system, produce a useful outcome. So, for example, you hear the electric jug finish boiling (sensory state) and you proceed to carry out the steps required to make coffee; repeating a sequence that you have practiced countless times before.

Habitual action is model-free. Model-free based behavior is simpler than model-based, and it is the most efficient strategy for oft-repeated sequences of actions, or for responses that need to occur urgently (eg: hitting the brakes when a child runs in front of your car).

The ML community has also devised mechanisms for model-based learning. There are still debates, but the human brain likely uses model-based control for situations where we don’t have a well-practiced habitual sequence of actions that would be appropriate for a given situation. Model-based systems separate the problem into two parts:

  1. a model that represents the structure of the world and how its different parts behave and interact, and
  2. a planning mechanism for searching that model to devise a sequence of actions that would reach a certain goal.

Model-based control is often thought of as goal-oriented. The agent or organism takes in sensory information and infers a current state. From that state, it decides upon a goal. It then deduces and carries out a planned sequence of actions that are optimized to transition the agent or organisms toward that desired goal.

The processes of the goal-oriented control layer. Source: author

This is a) more complex than habitual control, and b) significantly more adaptive. Via goal-oriented behavior, an individual can figure out how to reach a goal that they have never tried to reach before. Likewise, it can correctly respond to situations that it has never previously encountered, provided that it has a good enough model of the world.

The learning mechanisms in goal-oriented behavior are also more complex. Firstly, you need to learn the model. Secondly, you need to learn how to use it.

The good thing is that the split between model and usage has some benefits. For one thing, all experiences contribute to the same model, regardless of what the goal was at the time, and regardless of which form of behavioral control was in force at the time. Thus, in humans (and other mammals), the learning of the model is a sort of automatic passive process. For any observation or experience, mechanisms (that are probably genetically hard-wired) automatically take the information and incorporate it into the model.

As this process can benefit from any experience regardless of the control system in use at the time, humans can start to learn models of the world and of themselves from the moment of their birth, or even earlier. Thus the learning of the model is separated from the system that uses the model. This enables the model learning to start immediately, while the development of goal-oriented behavior control can be delayed. This is very likely the case in humans. Reflexes bootstrap learning of habitual behavior, and then reflexes and habitual behavior together bootstrap learning of goal-oriented behavior.

We are still trying to understand the mechanisms employed by the brain for the action half of goal-oriented behavior. One likely contender can be seen in the Active Inference theory of motor control (Adams et al, 2013). Active Inference proposes a mechanism by which a current state, a goal, and a model, can be iteratively processed by a population of neurons such that the system converges upon an appropriate action to meet that goal given the current state. It has been suggested (contentiously) that the motor cortex in the human brain functions on that very mechanism (Bennet, 2023). In other words, rather than human motor control being executed via a model-free system, it actually uses a model-based system; even for some habitual actions.

This blurs the line between model-free and model-based. Likewise, the distinction between habitual and goal-oriented behavior is blurry: given that I know how to make coffee and have done it many times before, and given that my goal is to make coffee, would you consider that my actions carried out are habitual or goal-oriented?

Habitual vs Goal-oriented Behavior

We will likely be arguing for decades to come about the distinction between habitual and goal-oriented behaviors

The deeper you look at these distinctions, the more blurry they become. For example, when consciously working through a problem we are presumably using our model-based systems for our mental activity. However, many individual steps seem to involve passively holding onto a sub-problem and waiting for the answer to appear. There are many possible interpretations, but one is that the processing in the interim is carried out by habitually controlled processes.

Perhaps a better example is that of the selection of a “problem-solving strategy”. We rarely think much about which particular mental strategy to use when faced with a particular cognitive problem, but there are numerous strategies that we use. For example, when given a trivial maths problem like 3 + 3 we may select the strategy of holding onto the problem and letting the answer pop out of our memory. When given a slightly harder problem of 27–18, we might pick a strategy of subtracting 17 first, bringing us to a round 10, then subtracting the remaining 1 afterward. In contrast, given something much harder like 5323–3437, we might pull out a notebook and start scribbling or simply reach for the calculator. While we might have to work on the individual steps associated with the particular strategy, the selection of the strategy itself is automatic — or in other words, habitual. This extends to all sorts of more practical “in the world” scenarios.

We will likely be arguing for decades to come about the distinction between habitual and goal-oriented behaviors, but I’d like to propose a simplistic way out for now:

  • Firstly, treat the terms “habitual” and “goal-oriented” as keyword terms that are used to refer to particular groups of mechanisms and behaviors, and may not literally mean what they say — eg: habitual action is not literally devoid of goals.
  • Recognize that any given action is orchestrated by a mixture of model-free systems.
  • Habitual actions are those where the largest part of their orchestration is done by model-free systems.
  • Goal-oriented actions are those where the largest part of their orchestration is done by model-based systems.

In terms of actual brain systems involved, one suggestion is that the basal ganglia provide much of the model-free control, the cortex provides the model-based control, and the two work in concert (Bennet, 2023).

Culture

The behavioral control systems of an individual are not just their own, but also of their group.

So, why did I include culture in this mix?

There are many things that culture adds to our learning, some of which I will discuss shortly. But I think there is one key thing provided by culture that our goal-oriented behavioral control system depends on for valid functioning: constraint.

Any learning system must be constrained. The number of parameters of the learning system must be constrained, or it will have too many parameters that can never be sufficiently optimized. The learning speed of the system must be constrained or it will suffer catastrophic forgetting. To some extent, the class of the problems exposed to the learning system needs to be constrained or it will be unable to find an optimal value for its parameters that are of any use for any given specific problem.

In biological organisms, most of these constraints are imposed by their biology. Evolution naturally favors the fewest neurons that meet the needs of the organism, due to their high energy consumption. The amount of information exposed to brains is limited by the sensory organs — organisms with simpler brains tend to have simpler sensory organs.

Similarly, the range of possible effective configurations has an effect. A learning system tries to converge upon a single optimum set of parameters. If there are many almost-equally optimum configurations, then it is harder for the system to converge. In the extreme case, the learning system may even oscillate between configurations.

The range of possible habitual behaviors is first constrained by the reflexes. It is then constrained by the level of curiosity and by the goal-oriented system, as the habitual system will only learn those behaviors that the agent chooses to carry out. This provides a sort of sandwich of constraint: the reflexes provide a bottom-up constraint by preventing the most destructive behaviors from being learned (eg: by preventing us from stepping off high-up ledges), and the curiosity and goal-oriented systems provide a top-down constraint by preferentially presenting certain behaviors to be learned.

The goal-oriented system is equally constrained in a bottom-up way by the reflexes. But where does it get its top-down constraint from? Through experience, the environment itself provides a certain amount of information about which behaviors are better than others. However, the search space of behaviors is too large for a single individual to efficiently explore.

Here’s an example. A hairless ape-like creature is placed on the edge of a jungle. To stay warm, it needs to find something to cover itself with. To stay dry, it needs shelter. There are many different ways to find or build a shelter, some better than others, but all are basically effective. The solution space is large. An individual can find some of those solutions, but it may take a long time.

The way that humans have solved that (and indeed all primates) is to share knowledge from individual to individual. Knowledge communicated from one individual to another provides a constraint on that second individual’s behavior. For example, they may be instructed that to meet the need for shelter they should obtain leafy branches of a particular type, to overlap them in a particular way, and to erect them supported by a particular other kind of branch. Alternatively, they may be instructed to pay a certain rent to a local motel and to just stay there because it’s a lot less effort.

The choice of bush tent vs motel is partly based on practical things like hygiene and exposure to the elements. But it is also based on arbitrary preferences, such as that clothes covered in dirt are not considered socially acceptable in a restaurant, and so you better sleep in the motel. Again, this provides a constraint on the range of goal-oriented behaviors. Importantly, in a search space that would otherwise put the bush tent and motel as equally effective behaviors, social norms work to narrow the space of optimum behavior.

This latter reason is the example more commonly associated with the word “culture”, but for this post I shall use culture in a broad sense, encompassing any form of knowledge shared between individuals.

Thus culture provides important constraints against the search space of possibly optimum goal-oriented behaviors in two key ways:

  1. By conveying knowledge about constraints that truly exist in nature but which the individual’s model does not yet include (eg: that a bush tent is an effective method for shelter), and
  2. By further narrowing the range of otherwise equal behaviors, so that the learning system can stably converge to one set of behaviors.

Before I wrap up this section I will briefly mention one further reason why I include culture as the fourth level of human behavioral control. The interaction between culture and the goal-oriented system is not just from culture to individual. It also flows in reverse. The continual actions of all individuals within a group continually shape that group’s culture. As individuals and groups of individuals learn new skills and obtain new knowledge those new skills and knowledge feed back into the group, changing their culture. The learning of an individual occurs over days to weeks. The culture also learns, over years to decades. And that changed culture then applies a newly adjusted constraint against the individuals within the group.

Thus the behavioral control systems of an individual are not just their own, but also of their group.

Source: author + Dall-E

Modern ML

The first question that, in my opinion, does not receive enough discussion is: which behavioral layer (or layers) do these modern deep learning systems emulate?

Now that we have a better understanding of the example of intelligence that we would like to emulate, we can take stock of what we’ve achieved so far in machine learning. To do that, I’m going to focus on our recent advances in deep learning.

We’ve all been amazed at the success of recent Large Language Models (LLMs) like GTP 3.5 and 4 to produce human-like language responses. The successes are even more impressive when considering that we don’t know why they work so well. Added to that, there are several problems faced by the architectures used by modern state-of-the-art deep learning models in general and LLMs specifically:

  • They require vast amounts of data to train to a sufficient level to do anything useful.
  • There are repeated predictions that we’re going to plateau at some point.
  • Experts in the field are saying that we can’t solve general intelligence with the current architectures, especially not that of LLMs.

All of the successes of deep learning and LLMs, and their failings, can be understood better when considered in light of how the human brain produces intelligent behavior — through its four layers.

Classifying Deep Learning Systems

The first question that, in my opinion, does not receive enough discussion is: which behavioral layer (or layers) do these modern deep learning systems emulate?

Let’s focus on LLMs. We can understand them by looking at i) how they are trained, ii) how they operate at runtime, and iii) how much is hard-coded by humans vs learned from experience.

Firstly, their operation at runtime. The most recent LLMs include many optimizations to work around limitations of the maximum context length, but to keep things simple I’ll focus on the original architecture outlined in Attention is All You Need. The entire request sentence is fed into the system in one go. That entire sentence is processed through the first network layer of the first encoder unit, followed by the second layer of that unit, etc., followed by the layers of the various second, third, fourth, etc. encoder units. The final output from the last encoder unit is then fed into the first layer of the first decoder unit, etc., etc. until the final layer of the final decoder unit produces a sequence of embeddings which are then converted into human text. All of that is done through a single pass.

Input → state → response.

Habitual action.

The LLMs are to the trained through a mixture of methods, but to the greatest extent, we use reinforcement learning. For every response produced by the LLM, a loss function is computed that quantifies how good the response was. Through stochastic gradient descent, the mappings from input to response are adjusted. There’s no explicit modeling of the world that is independent of action production. For example, when “teacher” training is used in a supervised manner, the LLM does not update some internal part of its model based on the observation. It can only be trained from that observation by producing its own response to the same input, and then having the behavior response parameters re-optimized to better approximate the teacher example.

Habitual action.

Any “goal” is an intrinsic nature of the loss function employed by the training algorithm. And that “goal” is hard-coded by the ML engineers training it. The LLM is completely unaware of the goal during runtime. The original approach used to solve the alignment problem (Ouyang et al, 2022) trained a second model that could be said to capture the goal of “alignment”. But it was merely used to further train the LLM. The LLM had no access to that model at runtime.

Habitual action.

So LLMs are a form of habitual behavioral control. The training algorithm and the loss functions employed are equivalent to the genetically hard-wired learning mechanisms employed in brains. To some extent, we could argue that the loss function amounts to the training pressure exerted by the reflexes. The extra training from the alignment model is a particularly good example for that case — lie or be rude to someone and you’ll feel the pain of guilt.

However, the training lacks the more complete and complex set of reflexes that humans employ to bootstrap their own learning. More significantly, the learning process of this LLM habitual system completely lacks the benefits of a goal-oriented system. A goal-oriented system would a) naturally drive the system to explore different behaviors in a way that is optimized by knowledge of the structure of the world that it needs to interact with, b) step in when the habitual system doesn’t know what to do, and c) employ knowledge of the world to reduce hallucinations.

Many common-sense mistakes are made by the LLMs of today because they don’t have a model of the world. For example, it is well known that they struggle with basic maths, and can’t even count.

Let’s now look at some specific areas for improvement in LLMs and in deep learning more broadly.

Listening to your own Speech

As already mentioned, LLMs would benefit from mixing its existing habitual control system with a goal-oriented system. One particular way in which LLMs could benefit immensely is in automatic mistake detection. Humans spontaneously detect and correct themselves when they make a mistake while speaking. The source of these mistakes is fascinating and best understood by looking at those individuals who lack the ability. Individuals with Wernicke’s aphasia (also known as fluent aphasia or receptive aphasia) produce fluent sound speech except that the statements may make no sense. In general, the sentences spoken by such individuals are syntactically and grammatically correct, but often include irrelevant or made-up words, or are just logically invalid.

Sound a bit like LLMs? It’s no coincidence. Individuals with Wernicke’s aphasia, usually due to stroke, struggle with the interpretation of other people’s speech and their own. Consequently, they cannot listen to their own speech and interpret it. If they say something meaningless, it’s as if they don’t hear themselves saying it, and so they don’t pick up on the mistake. In neurotypical individuals, the process of re-interpreting their speech means that they form a new mental model of the meaning of the spoken speech, and then compare that against the meaning that they were attempting to convey.

This doesn’t just help with mistake detection. It helps with learning too. As a pre-vocal child first begins to understand the language of those around them, they can attempt to produce similar sounds until what they hear triggers the same mental model as what would be triggered by an adult speaking. Then they can gradually improve their repertoire of vocalizations without direct feedback from others. It’s no wonder that babies babble so much. Perhaps our LLMs would benefit from doing the same.

Model based Learning that actually learns

I greatly dislike the way that the term “model-based learning” is applied to ML today. Most model-based learning systems are a hand-coded planning algorithm that uses standard human-generated algorithmic techniques for searching the space of possible action trajectories. In the earliest days of model-based control, even the model was hand-rolled by humans, based on knowledge of the physical properties of whatever system was being operated against.

Nowadays we do what should be called “model-learning”: we use ANNs to learn a model of the world, represented as a state transition matrix from current state S to new state S’ given some action A, P(S’|S,A). And then we plug that model into our hand-rolled planning algorithm.

Human model-based learning doesn’t work that way. We learn both.

Different problems call for different solutions. Different kinds of problems call for completely different kinds of solutions. A planning algorithm might be the best way to use a model for navigation or construction. But it’s probably not the best for other kinds of problem. I mentioned earlier about how humans automatically pick different mental strategies for different problems. I see a planning algorithm as one form of strategy. Another obvious strategy is trial and error. I’m certain that there are many other strategies that we employ, and that all of them are learned.

Unfortunately, we don’t know how humans learn to do planning; or any other strategy for that matter. We certainly don’t know how to combine model-learning with planner-learning.

In the last part, I’m going to offer my ideas of how we might begin to solve that problem.

Model-based without Model-free

I’ve been following the work of Karl Friston for some years. His Active Inference theory of brain function proposes that almost all functioning of the brain is in a model-based generative fashion. For example, the active inference interpretation of motor control is that the basal ganglia propose a desired goal state, and the various motor cortex layers then predict the best sequence of motor actions to reach that goal. This doesn’t immediately result in observable behavior. Instead, what happens next is that those planned actions are validated against the model to predict what the outcome would actually be. Finally, the error between the predicted outcome and the desired outcome is then used to revise the action plan. Taken to its logical extension, even that first forward pass to predict the initial action plan is itself just an error signal. Thus, all computation is of two forms: i) using an error signal to update a current action plan, and ii) testing that action plan against the model to predict the expected outcome.

Action through prediction. Source: author*

The proposal is elegant in the way that this single simple mechanism can be applied against potentially any problem, not just for motor control. Active Inference takes that logic one step further and concludes that the goal state too is decided through the same mechanism. Given the current state, what’s the best goal state? The one that minimizes the current “free energy”: the sum of all knowledge areas that the individual has the greatest uncertainty about, plus any ways in which the individual is currently exposed to the likelihood of negative outcomes (such as due to danger or lack of sustenance). How to find the goal state that will minimize it? By modeling the free energy space, trying out different goals, and using the model to predict how well the goals will minimize the free energy.

Karl Friston has recently teamed up with a for-profit organization to use Active Inference to solve real-world problems. Friston has claimed that Active Inference is the future of AGI, instead of the current deep neural-network models.

Now, I think the principles behind Active Inference are going to play a big part in AGI. But it’s not the only part. Active Inference is a pure model-based solution. That’s great for flexibility and dealing with novel situations. But the problem with model-based solutions is that they’re not efficient for fine-detailed, highly repeated actions.

To put that into context, we can consider a theory of how the cerebellum is used. The motor cortex may well be involved with planning out courses of action, but the exact detailed, and carefully timed orchestration of individual muscles to carry out those actions is probably controlled by the cerebellum (Carter et al, 2019). The cerebellum is a very simple feedforward network and is thought to use only model-free supervised learning (Raymond & Medina, 2018).

The point is that for the most efficient learning with the fewest number of neurons, and the most accurate yet flexible control, we require a combination of model-based and model-free control.

Online Learning

People often discuss how to define AGI. Given that we here have taken it to mean Artificial Human-like Intelligence, we can be clear in one thing: an AGI must learn, from experience, and it must do so continually.

Another way of thinking about AGI is that it is a general-purpose, adaptative, computational machine. And it can hardly be called adaptive if it can’t learn, from experience, in a short amount of time.

This is known as online learning, and unfortunately, it’s still largely unsolved in ML. Many techniques are being worked on (see this excellent list for some reading material), such as sparse representations and predictive coding. But all are in their infancy as yet (Prabhu et al, 2020).

Online learning should not be confused with the so-called “few-shot” and “zero-shot learning” in LLMs (Brown et al, 2020). Those work by providing a cue to the LLM and then having it generate a novel response based on the cue. No learning parameters are updated. The LLM is simply using the context provided to it within the request in the construction of the response. This is adaptive (in the short term). But it is not learning.

LLMs Today

What does this all mean for our understanding of LLMs today?

Firstly, while we’re stuck in habit-land at the moment, we can still achieve super-human results on certain tasks. LLMs are trained on a backlog of text far greater than any human will ever read in their lifetime. It seems increasingly clear that with that much information, the LLMs can generalize to the structure of the world and genuinely produce entirely novel responses that are consistent with our world.

Humans produce the largest portion of their behavior through habitual action. While human habitual behavior is trained online through lived experience and in conjunction with goal-oriented behavior, with sufficient data it is easy to artificially emulate that training environment and to produce a useful system that uses solely habitual action. I see LLMs as that very system.

And given their super-human level of training, LLMs can do a lot more with habitual behavior alone than humans can.

But they still lack the adaptive capabilities of humans.

And then there’s efficiency. As discussed earlier, model-free habitual control is very efficient for regularly repeated behaviors. However, it scales up with the range of behavior. As more behaviors are to be encoded, the size of the model-free network must be increased rapidly. In contrast, within a model-based system, an unbounded range of behaviors can be produced from a single model. Thus, to a certain extent, the model of the world only needs to grow to a certain fixed size and then it is sufficiently detailed to support an exponential increase in behavioral repertoire.

Thus, I suspect that it is true to say that our existing habit-driven architecture of LLMs will indeed plateau. At some point, we will find that it’s too costly to keep increasing its capacity exponentially while we only obtain a linear gain in the behavioral repertoire. We need to solve for true model-based behavior.

Source: author + Dall-E

The Roadmap

My goal at this point is in two parts. Firstly to provide some perspective on where we are by highlighting what steps are remaining for us to reach anything that could generally be agreed as AGI. Secondly, to propose the beginnings of a possible solution to one of the larger problems.

To achieve human-like AGI, we need to take on those improvements already mentioned and some more:

  • Online learning. It’s not AGI if it can’t learn on the spot.
  • Examine human reflexes that contribute to our pro-social behaviors (eg: guilt, compassion, the weird instinct to smile in return to others) and our anti-social behaviors (eg: desire, anger), and also how emotions contribute. We need to incorporate more of those into our training algorithms. And they’re going to be even more important once we have online learning.
  • This should be extended to investigating other hard-wired mechanisms in human cognition. For example, we found that curiosity provides an excellent tool for exploring the behavior space. What other innate drivers would benefit a learning agent, particularly an online learning agent?
  • Combine model-free with model-based behavioral control. True model-based control — with both the model and the strategies being learned.

An Inspiration for Model-based Cognition

I’m going to now share some thoughts on how to solve the model-based control problem. This will need a little backstory, so bear with me for a minute.

I’ve been studying the underpinnings of consciousness for about a decade, and I have come to realize that a big part of why it exists is to perform meta-management on our cognition. Specifically, to monitor and adjust our cognitive state during deliberation.

During deliberation, there is no environmental feedback other than the opportunity costs and risks of deliberating for too long. Internal senses in our body provide no other feedback. The cognitive processes must provide their own feedback as to whether a particular course of deliberation is heading in a productive direction or not. That is meta-management.

In practice, the process of deliberation is best seen as a form of navigation, through cognitive state space. It turns out that cognitive state space is just as complex as the real world, and navigation through it requires some of the same strategies that we apply in the real world. There are regions equivalent to dead-ends. Some regions result in fruitless busyness, such as those that cause the cognitive process to trail off onto unrelated mind-wondering. The solution is to map the (cognitive) world in which we (cognitively) exist; and to use that map to aid navigation.

Comparison of navigation in (a) real-world and (b) cognitive state space. Source: author.

Initially, such a cognitive map would relate very closely to concrete real-world problems, like how to find food. But the power of such a map is that it isn’t restricted to concrete material needs — it can also be used to map out abstract ideas. And the same cognitive navigation abilities can be used to navigate through those abstract ideas.

Notice how this is starting to sound like model-based control: there’s a model that represents something about the world and a planner that uses that model and different strategies to decide upon the best course of action. Importantly, there is a natural way in which both can be learned. The model can be built up from experience as discussed earlier. The planner learns to navigate.

This provides an idea for how we might be able to build a model-based system that can operate against both real-world interaction problems and cognitive deliberation problems, and which can learn both the model and how to plan against it. I think that that alone would be a huge step towards AGI.

Let’s take a closer look at how that might work…

Learning the Model

Something is fascinating to me about the way that the brain incorporates a mixture of hard-wired mechanisms with processes that are learned from experience. As discussed, while we may consciously control our actions while learning a new skill, the process that takes the observations of the actions and converts them into learning appears to be automatic and hard-wired. There’s a lot of debate, but one theory is that the hippocampus stores up events over a day like in a buffer. Then during sleep, it replays those events multiple times against the action systems, leading to the necessary synaptic weight changes (Carter et al, 2019). That process is complex. And it is almost certainly entirely hard-wired.

The cognitive map can be learned similarly. Any action of cognitive processing, regardless of the problem being undertaken at the time, leads to sequences of observations that can be used to update the cognitive map. As we learn about strange abstract concepts such as numbers, equations, atoms, and junk food, those concepts and their relationships to others get laid out against cognitive state space, further informing the modeling of that space.

Now, if we were to attempt to build an AGI based on this, we’re struck with problems of how to represent this cognitive map, how to identify things that should be modeled into it and the choice of algorithm for model learning. I propose a simplistic starting point:

  • Use an online parameter-free clustering algorithm (ie: no pre-set K) to group observations (eg: following Scherreik, 2020, or Rigoli et al, 2017).
  • Use the information within groups and between pairs of such groups to establish relationships. Thus forming a knowledge graph, with clustered groups as the nodes, and the relationships between groups as vertices.
  • A useful representation of group-to-group relationships may well be that of probability distributions. Thus the knowledge graph is a form of Bayesian network.
  • The clustering algorithm needs discrete events as its source. That can be achieved through the use of predictive mechanisms and surprisal: if an event is surprising, then add it into the model.

Thus, from a constant stream of external, internal, and cognitive sensory input a graphical map can be constructed that captures the key components and how they relate to each other:

Knowledge graph generated from clustering algorithm. Source: author.

One very cool feature of this is worth mentioning. A common issue with clustering algorithms is that they don’t produce perfect groupings. A further problem with online clustering algorithms is that they are tremendously sensitive to the order in which data is received, and to noise. Thus the resultant map will be in a constant state of inaccuracy and flux. As will discussed in the next section, the system that uses this graph is itself a learning system, and so it doesn’t matter that the clustering is imperfect. The wonderful thing about this solution is that almost any clustering is good enough, and it will get better over time as the individual obtains new observations.

Learning to Navigate

Here’s the cool part.

The ML/AI community has trained cars to navigate roads and has trained robot arms to plan sequences of actions for putting blocks on top of each other. Now it’s time to train artificial cognitive systems to navigate their own cognitive maps.

So we have a knowledge graph that is in the form of a Bayesian network. What strategy should we use to search this graph when solving a particular problem? Breadth first? Depth first? Heuristic A* search? Answer: we don’t. We let the AI learn to follow whatever strategy works.

A model-based planner. Source: author

We set up a policy network that uses associative memory lookup for querying the knowledge graph. We allow that policy network to deliberate: to iteratively search the graph however it sees fit and to choose when to produce a final result.

Then we train it against whatever problem domains we want, in a reinforcement learning setting, with a suitable loss function so that it learns to use that knowledge graph to efficiently produce results.

We can take several further steps towards human-like intelligence. Firstly, instead of running the network in “epochs” as we do in ML today, we would leave this policy network running indefinitely — able to produce results whenever it wants, based on whatever it’s thinking about at the time. Secondly, we would likely need to leverage my ideas around meta-management and have this system monitor and model its computational state trajectories.

Additional steps toward human-like intelligence would grant such a system the ability to find its own interests and problem domains by using some form of curiosity-driven learning, such as Active Inference.

LLMs to the Rescue

Knowledge-graphs, query mechanisms. It’s starting to sound like LLMs again.

Can we use the power of LLMs and Retrieval Augmented Generation (RAG) to build our first AGIs? I don’t know. But it’s certainly worth investigating.

Taking the first Steps

It’s my dream to build the system that I describe above. And I will be taking the first steps toward building parts of it in the months that come. I’m planning to start small, with a very simple knowledge graph, and a very simple policy network that just learns to do deliberative processing. The idea of using LLMs is an interesting one that I might follow up on too, but I think the future needs a new kind of algorithm so I think there’s value in starting afresh.

So stay tuned.

What are your thoughts about all this? I’m certain that there are many flaws in my simplistic proposal. What are they, and how might we overcome them? Let’s hear the result of your cognitive deliberations in the comments.

References

Adams, R. A., Shipp, S., and Friston, K. J. 2013. Predictions Not Commands: Active Inference in the Motor System. Brain Structure and Function, 3, 218. https://doi.org/10.1007/s00429-012-0475-5.

Adron, J. W., Grant, P. T., and Cowey, C. B. (1973). A System for the Quantitative Study of the Learning Capacity of Rainbow Trout and Its Application to the Study of Food Preferences and Behaviour. Journal of Fish Biology, 5. https://doi.org/10.1111/j.1095-8649.1973.tb04497.x.

Bennet, M. (2023). A Brief History of Intelligence. Mariner Books.

Brown, T. B., Mann, B., Ryder, N., et al (2020). Language Models are Few-Shot Learners. ArXiv. https://arxiv.org/abs/2005.14165

Carter, R., Aldridge, S., Page, M., & Parker, S. (2019). The Brain Book: An Illustrated Guide to its Structure, Function, and Disorders (3rd ed.). DK Publishing.

Dayan, P. (2008). The role of value systems in decision making. In C. Engel and W. Singer, eds, Better Than Conscious? Implications for Performance and Institutional Analysis (pp. 51–70). MIT Press. https://psycnet.apa.org/record/2007-18938-003 (full text).

Ouyang, L., Wu, J., Jiang, X., et al (2022). Training language models to follow instructions with human feedback. ArXiv. https://arxiv.org/abs/2203.02155

Prabhu, A., Torr, P. H. S., & Dokania, P. K. (2020). GDumb: A simple approach that questions our progress in continual learning. ECCV. https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123470511.pdf

Raymond, J. L., & Medina, J. F. (2018). Computational Principles of Supervised Learning in the Cerebellum. Annual review of neuroscience, 41, 233–253. https://doi.org/10.1146/annurev-neuro-080317-061948

Rigoli, F., Pezzulo, G., Dolan, R., & Friston, K. (2017). A Goal-Directed Bayesian Framework for Categorization. Frontiers in Psychology, 8, 408. https://doi.org/10.3389/fpsyg.2017.00408

Scherreik, M. D. (2020). Online Clustering with Bayesian Nonparametrics. Wright State University. PhD thesis. https://corescholar.libraries.wright.edu/etd_all/2409/

Sloman, A. (1998). Damasio, Descartes, Alarms and Meta-management. SMC’98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics, 3, 2652–2657. https://doi.org/10.1109/ICSMC.1998.725060.

Vaswani, A., Shazeer, N., Parmar, N., et al (2017). Attention Is All You Need. ArXiv. https://arxiv.org/abs/1706.03762

Wheatley, T., and Wegner, D. M. (2001). Automaticity of Action, Psychology of. International Encyclopedia of the Social & Behavioral Sciences. Elsevier.
https://scholar.harvard.edu/dwegner/files/wheatleywegner.pdf

(* — some icons used within diagrams came from flaticon.com, with modifications)

--

--

Malcolm Lett

Software engineer, consciousness enthusiast, reskilling as a ML engineer. Originally from NZ, and now based in Chennai, India.