A Path Towards Autonomous Machine Intelligence (Part-1)

Published in

AI & ML Explainer

9 min readJun 11, 2023

Artificial Intelligence, Machine Common Sense, Cognitive Architecture, Deep Learning, Self-Supervised Learning, Energy-Based Model, World Models, Joint Embedding Architecture, Intrinsic Motivation

Below is my attempt to read and understand a positional publication:

A Path Towards Autonomous Machine Intelligence
Version 0.9.2, 2022–06–27
By Yann LeCun

This is the part-1 post in a multi-part series. Links to the future post will be updated here.

What is this positional publication about?

This is Yann LeCun’s vision for the future of AI where machines learn and behave more like humans and animals (driven by intrinsic objectives rather than hard-wired programs, external supervision, or external rewards)

Abstract (taken verbatim)

How could machines learn as efficiently as humans and animals? How could machines learn to reason and plan? How could machines learn representations of percepts and action plans at multiple levels of abstraction, enabling them to reason, predict, and plan at multiple time horizons? This position paper proposes an architecture and training paradigms with which to construct autonomous intelligent agents. It combines concepts such as configurable predictive world model, behavior driven through intrinsic motivation, and hierarchical joint embedding architectures trained with self-supervised learning.

What would you accomplish reading this explainer?

Understand the proposed architecture and training paradigms on how to construct autonomous intelligent agents
Understand the concepts such as
— configurable predictive world model,
— behavior, driven through intrinsic motivation, and
— Joint Embedding Predictive Architecture (JEPA) and Hierarchical JEPA (H-JEPA)

with the goal set, let’s get rolling —

Introduction

Even after incorporating substantial supervisory data, extensive reinforcement learning trials, and pre-programmed behaviors, the reliability and efficiency of machine learning systems are quite low when compared to human learning processes.

The answer may lie in the ability of humans and many animals to learn world models, internal models of how the world works.

Challenges

Three primary challenges for AI to become more human-like in its capabilities:

How can machines learn about the world without direct interactions, through observation alone, to minimize costly and risky trials needed for task learning (since a lot of interactions in the real world are expensive and dangerous)?

How can machines use logic-based (reason and plan) approaches that are compatible with gradient-based learning? Our current best learning strategy involves estimating and using the gradient of a loss which requires differentiable architectures (learning from data to optimize a model’s internal settings, such as the weights and biases in a neural network). Despite its effectiveness, this process struggles to blend with traditional logic-based problem-solving.

How can machines learn hierarchical representations and action plans across different levels and time scales? Humans and animals can break down complex actions into simpler sequences of low-level ones using multilevel abstractions, allowing for long-term predictions and planning.

Terminology

Cognitive Architecture: design or structure of a system that imitates the human brain — a system that can process, learn, and understand information similar to how humans do.

Differentiable: when something is described as differentiable in ML, it means it can undergo a process (like a mathematical operation) where small changes in the input will produce predictable changes in the output. This is crucial for learning from errors and improving the model’s performance over time.

Trainable Modules: parts of a system that can be improved or “trained” by feeding them data and allowing them to adjust and learn from it.

Non-generative Architecture for Predictive World Models: a system that doesn’t generate new data samples but uses existing data to predict or understand future situations.

Hierarchy of Representations: a way of organizing data or information at different levels, from basic to complex, similar to how humans process information.

Non-contrastive Self-supervised Learning: a particular strategy used in self-supervised learning that does not rely on comparing or contrasting different pieces of data.

Informative and Predictable Representations: ways that the system can understand and present data that provide useful information and can be used to accurately predict future outcomes.

Hierarchical Planning Under Uncertainty: making a series of decisions or plans at different levels or stages, even when there is uncertainty or incomplete information about the future. This could be likened to planning a multi-stage project with unknown variables.

Proposal

The author proposes a single, configurable world model engine, enabling the sharing of knowledge across tasks and reasoning by analogy, which is a departure from having a separate model for each situation.

The document outlines an overall cognitive architecture where all modules are differentiable, and many are trainable, a non-generative architecture for predictive world models that learn a hierarchy of representations, and a self-supervised learning paradigm.

It lays out a novel architecture — Joint Embedding Predictive Architecture (JEPA) and Hierarchical JEPA — that focuses on non-generative predictive world models that can learn a hierarchy of representations.
A novel, non-contrastive self-supervised learning paradigm that is designed to produce representations that are simultaneously informative and predictable.
Furthermore, it demonstrates how Hierarchical JEPA can be utilized as a foundation for predictive world models in hierarchical planning amidst uncertainty.

Learning World Models

The key tenet of this proposal is the “world model”, which is an internal mental model that intelligent beings (both humans and animals) develop to represent their understanding of how the world operates.

This is the remarkable ability of humans and animals

to learn vast amounts of background knowledge about the world through observation, requiring only a small number of interactions.
the accumulated knowledge, often referred to as common sense, enables them to make predictions, reason, plan, explore, and imagine solutions.
it emphasizes the importance of world models in this process, which allow for learning new skills with few trials, predicting consequences, and avoiding dangerous mistakes.

The challenge is of developing machine learning paradigms and architectures that enable machines to learn world models and utilize them for prediction, reasoning, and planning. A key technical hurdle is how to create trainable world models that can handle complex uncertainty in predictions.

Learning Hierarchies of Models

Infants build up their knowledge of the world in a hierarchical fashion, with simpler concepts providing the foundation for more complex ones.

This process starts with learning basic principles, like object permanence, the concept of three dimensions, and the properties of objects and their movement. As these foundational concepts are established, more complex ideas are developed on top of them, such as understanding gravity, inertia, and other intuitive physics.

There are certain responses and drives that are innate or “hard-wired” into our system. These could be physiological needs, like hunger or thirst, or instinctual behaviors that are typically linked to survival. These intrinsic motivators and behaviors guide learning and interactions with the environment.

These steps listed below provide a framework for how humans acquire fundamental knowledge about the world around them in the early stages of their lives.

Perception of the World (0–3 months): From the moment they are born, infants begin to understand that the world is three-dimensional and that every source of light, sound, and touch has a distance from us. Parallax motion — the change in perspective we experience when moving our head or when looking at an object with different eyes — makes the depth obvious and helps us perceive the world in 3D.
Recognition of Objects (3–4 months): The depth perception leads to the concept of objects — identifiable entities that can occlude more distant ones. Infants learn that objects exist in the world and can be assigned to broad categories based on their appearance or behavior.
Object Permanence (4–7 months): Children learn that objects do not spontaneously appear or disappear, change shape, or teleport. They move smoothly and can only be in one place at any one time. This understanding is called “object permanence.”
Understanding Object Dynamics (~7 months onward): Once object permanence is established, children begin to learn that objects have different properties. Some objects are static, some have predictable trajectories (like a ball thrown in the air), some behave unpredictably (like leaves in the wind), and some seem to obey different rules (like animals).
Grasping Basic Physics (8 months onward): As their understanding of objects grows, infants start to grasp basic principles of physics such as stability, gravity, and inertia. They learn that objects fall down, not up, that a stably placed object won’t fall over, etc. They apply these principles to understand the behavior of objects in the world.
Cause-and-Effect Relationships (~12 months onward): As they see the effects of animate objects (including their own actions) on the world, children learn about cause-and-effect relationships. For example, they learn that if they knock a cup off the table, it will fall and potentially spill or break.
Linguistic and Social Knowledge (~12 months onward): With the understanding of cause-and-effect relationships, infants start to acquire linguistic and social knowledge. They start to understand spoken language, recognize social cues, and eventually begin to speak themselves.

This corresponds to the learning progression shown in the chart below by Emmanuel Dupoux (Figure-1).

This acquired knowledge, combined with hard-wired behaviors and intrinsic motivations/objectives, helps in predicting consequences, planning actions, and avoiding potential dangers.

While we do accumulate a vast array of these models, the brain’s capacity isn’t infinite. Furthermore, the world is complex and constantly changing, meaning it’s impossible to have a model for every potential situation.

So can a human or animal brain contain all the world models that are necessary for survival?

Hypothesis: Animals and humans have only one world model engine somewhere in their prefrontal cortex.

The world model engine (Figure-2) is dynamically configurable for the task at hand. With a single, configurable world model engine, rather than a separate model for every situation, knowledge about how the world works may be shared across tasks. This may enable reasoning by analogy, by applying the model configured for one situation to another situation.

Here is a description of the proposed model that consists of several interconnected modules. Here’s a summary of each:

Configurator: This module takes inputs from all other modules and adjusts them to suit the current task. It’s key for the dynamic reconfigurability of the world model engine. It could be thought of as the meta-level controller, tuning the rest of the system based on the current task.
Perception: This module provides an estimate of the current state of the world based on sensory input, essentially how the organism perceives its environment at a given moment.
World Model: This is the core of the system. This module predicts potential future states of the world based on imagined action sequences proposed by the actor module. It forms a kind of internal simulation or prediction of how different actions will affect future states and represents our understanding of how actions will impact the world around us.
Cost: This module calculates a single scalar output called “energy,” which represents the discomfort level of the agent. It includes two parts: the intrinsic cost, which computes the immediate energy of the current state (such as pain, pleasure, hunger, etc), and the critic, a trainable module that predicts future values of the intrinsic cost. This represents our evaluation of the current situation and its likely outcomes.
Short-term memory: This module keeps track of the current and predicted world states and associated intrinsic costs, allowing the system to recall recent information and helping us keep track of our current situation and what we expect to happen next.
Actor: The actor module proposes sequences of actions that might achieve our goals. The world model and the critic modules compute the possible outcomes of these proposed actions. The actor, our decision-making process, can then find an optimal action sequence, that minimizes the estimated future cost, and it will output the first action in that optimal sequence.

Predicting, Planning, and Learning New Tasks: With the foundational knowledge and flexibility of the single “world model engine”, humans and animals are capable of predicting the consequences of their actions, planning ahead, and learning new tasks quickly. This means not only reacting to the present but also anticipating and preparing for future scenarios, which is crucial for survival.