The Dawn of Synthetic Brains: How Large Language Models Are Shaping AI Agents

Deepak Babu P R
10 min readDec 29, 2023

--

Recent advancements in Large Language Models (LLMs) have demonstrated a significant capability for reasoning and planning, a trait predominantly seen in models with over 100 billion parameters (considered an emergent ability). While opinions differ on whether LLMs genuinely possess intrinsic reasoning abilities, it’s becoming increasingly evident that, through sophisticated prompting strategies, LLMs can be equipped with the necessary ‘tools’ to collaboratively tackle complex problems in a manner akin to human problem-solving. This has reignited interest in the development of Agents, a field that had largely remained in the realm of academic research for decades. In this article, we delve into the role of LLMs in advancing agent technology, examine their inherent limitations, and explore how current agent research is striving to overcome these challenges.

What exactly defines an agent? In simple terms, an agent is an entity capable of perceiving its environment and acting to alter that environment’s state. A crucial aspect of an agent is autonomy, or the ability to act independently without human supervision. The concepts of ‘state’ and ‘environment’ are central to this definition, and we’ll explore these shortly. Agents have their foundation in Reinforcement Learning, a branch of AI/ML focused on machines that learn from their own experiences. These experiences might be derived from self-play, as seen in AlphaGo’s mastery of Go through repeated self-play, or from imitating human interactions with the world. Take, for instance, a home security system: if a locked house represents a certain state, then the system’s action of unlocking the door represents a change in that state, with the home being the environment. It’s important to note that the concept of agents extends beyond the physical realm to include software agents. These digital agents can perform tasks like ordering a ride on Uber with a simple command, such as ‘order Uber to Seatac airport for 5 people,’ automating actions that would otherwise require manual input.

2-d chart showing range of tasks (general to specific) on y-axis with accuracy (incompetent to super-human) on x-axis with red dots showing how agents rank on these dimensions.

The classification of agents can be understood along two key dimensions: (i) the breadth of tasks they can handle and (ii) the accuracy or performance level in executing these tasks. Historically, most agents have been specialized, adept at narrowly defined tasks, such as playing chess or Go. For instance, AlphaGo demonstrated super-human performance in this realm, famously defeating world champion Lee Sedol in March 2016 with a score of four games to one. On the other hand, voice assistants like Alexa or Siri represent a broader scope, performing tasks ranging from home automation, like turning lights on or off, to providing weather updates or navigating screens. While their performance can be considered on par with human capabilities, they often come across as somewhat templated in their responses. To realize the goal of Artificial General Intelligence (AGI), we require systems that not only cover a wide array of tasks but also match or exceed human performance levels in these tasks.

Drawing inspiration from human behavior, the critical components of an intelligent agent can be conceptualized as follows:

  1. Brain: This is the central processing unit of the agent, tasked with interpreting environmental stimuli received through the perception module. It is responsible for decision-making, reasoning, and planning actions. Key functions include logical reasoning, forming analogies, and engaging in second-order thinking. The brain is the core of the agent, seamlessly integrating with perception, memory, and action modules.
  2. Perception: This module is akin to the human sensory system, responsible for sensing the environment. It includes vision (similar to eyes), speech and language processing (akin to ears), and can extend to other senses such as touch (comparable to skin) and olfaction (like the nose). Perception is crucial for providing the brain with necessary environmental information.
  3. Memory: Serving as the agent’s repository of experiences and facts, memory is essential for future retrieval of information. Often regarded as an integral part of the brain, it encompasses various aspects like habitual memory (for routine tasks like driving), short-term memory (for recent events), and long-term memory (for more distant recollections, such as childhood memories).
  4. Action (Actuator): This component can be likened to human actuators, such as hands and legs, which interact with and modify the environment. It can involve physical movements or communication methods like speech to convey thoughts and information.

So how does LLM powered agents work ? what parts of this agent component is left unadressed or needs work around ?

Components of Agent Architecture from the agents survey paper by Fudan NLP group showing brain, perception and action modules.

LLM as the brain

LLMs pretrained from large internet corpus text, audio and images can be considered as world models that can reason and plan much like the human brain and forms central aspect of agent architecture. Instruction fine-tuning makes the needed alignment with human-human like interaction, extending as an efficient interfaces for human-machine (HCI) interaction. Synthetically generated data has become a force multiplier in this effort. I recently wrote an in-depth blog post on this topic here. Prompting has become the way to elicit desired response from LLMs and leading to an emerging field of Prompt Engineering. While LLMs auto-regressive nature of generation has been shown to be a bottleneck in solving complex problems that require multi-step reasoning. We have worked around this limitation, by using advanced prompt engineering techniques like ReACT (Refine Act), Refine, Reflexion, CoT(Chain-of-thought), ToT (Tree-of-thought), etc. The fundamental idea behind these techniques is to ask the LLM to breakdown the solution into steps and template by using the LLM to fill-in the template to form a coherent solution to a complex problem. While there are overlaps in tecniques with minor modification, we can consider two class of prompting (i) sequential problem solving — problem is broken down as sequential steps (LLM conditioned gen. on previous step) and (ii) plan ahead solving that parallelizes problem into sub-problems and independently executes sub-problem speeding up the inference. BabyAGI and AutoGPT are two example frameworks that make use of these strategies to solve complex decision making and reasoning problems. There is another class of techniques that focus on automated prompt engineering (APE) to identify the best way to prompt a LLM for a desired task.

Images, Text, and Audio: Multimodal Perception in LLMs

The evolution of Large Language Models (LLMs) has progressed from solely focusing on extensive text corpora to integrating multimodal learning that encompasses text, images, and audio. This advancement is exemplified by models like Google DeepMind’s Gemini, which processes these modalities in unison. Multimodality is essential for creating an accurate representation of the state, crucial for effective reasoning. For example, in activity detection with a smartwatch, relying only on wrist movement data could be misleading. Incorporating additional sensory inputs like acceleration and GPS readings can more accurately confirm activities like walking. I had recently covered this in some depth as blog post here.

At their core, Multimodal LLMs convert each modality into a dense representation, either through pretrained embeddings or on-the-fly learning. This allows for the conditional generation of outputs in various modalities in an interleaved manner. These LLMs typically employ self-supervised learning approaches, such as those used in BERT, to predict masked tokens from mixed modalities. Cross-modal alignment is facilitated through supervised learning, involving joint modality prediction tasks. For instance, CLIP (Contrastive Language Image Pretraining) utilizes an image captioning task to align image and text modalities. Similarly, CLAP (Contrastive Language Audio Pretraining) aligns audio and text by employing audio classification tasks.

Unlike text, which is tokenized into discrete units like characters or sub-words, audio and image data are continuous signals. Therefore, they require discretization and the development of a unified codebook to be effectively integrated into the LLM framework.

great lecture on mutimodal LLMs from stanford CS244n

Memory in LLMs: Context Windows and Vector Databases

In Large Language Models (LLMs), the context window — utilized for providing instructions, examples, and contextual cues — acts as a form of short-term memory. This memory spans a few dialogue turns but is relatively limited in size, ranging from 2K to 100K tokens in state-of-the-art models. For long-term memory and fact retrieval, vector databases like Pinecone have been employed, leveraging dense embedding retrieval methods. Additionally, the parametric memory within LLMs’ weights, acquired during pretraining, represents a form of knowledge storage. However, this knowledge can be less reliable, sometimes leading to hallucinated facts and outputs, due to the varying quality of pretraining data.

Memory in LLMs is a complex topic. Retrieving relevant information from extensive knowledge bases or past transactions poses significant challenges, especially when multiple historical events may be pertinent to a given query. For instance, the ‘Generative Agents’ paper from Stanford proposes a retrieval method based on a weighted combination of relevance and recency for simulated agent societies. Meanwhile, the ‘Cognitive Architecture of Language Agents’ paper by Summers and Yao highlights the necessity for different memory types — long-term procedural memory, semantic memory for facts, and episodic memory for past interactions. There are ongoing efforts to understand human brain and memory functioning and to apply or augment these insights to agent architecture, developing efficient data structures for storing facts and events.

A critical issue in this domain is determining the appropriate size and boundaries for information chunks. Overly large chunks may exceed LLMs’ context length limits, while incorrectly defined or too small chunks can result in information loss or the need for extensive processing. When faced with multiple relevant passages for a query, a common approach involves summarizing or applying a secondary level of filtering as a preliminary step to fit within the LLMs’ context window.

For example, consider the question, “Have I ever visited this place?” This inquiry involves several layers: (i) identifying the current location (‘this’), (ii) determining the subject (‘I’), and (iii) searching through spatial information for past visits (‘ever visited’). Simple storage of experiences as text blobs is inadequate; a more structured approach, potentially involving a taxonomy or hierarchy, is necessary for effective information organization and retrieval.

Robotic Arms and APIs as actuators

In the realm of software agents, actuators can take the form of APIs and tools. For instance, a calculator API may be used for mathematical tasks, and knowledge graph queries can address factual questions. Additionally, software agents may simulate interactions like clicking, typing, and touch for seamless web and app navigation. In contrast, embodied agents, such as robots equipped with arms and legs, have physical counterparts to these APIs and tools, functioning as their actuators.

Utilizing tools and APIs is particularly crucial in addressing hallucination issues common in Large Language Models (LLMs). This is especially relevant for tasks like complex calculations, where, akin to human reliance on calculators, LLMs can benefit from API assistance. An example of this is HuggingGPT’s approach of providing LLMs access to every model in the Hugging Face hub as APIs for multi-step, higher-order tasks. Imagine a scenario where an LLM can access and utilize a full suite of AWS or Azure services as tools and APIs. This integration would enable the LLM to parse inputs and outputs for each tool, orchestrating them to accomplish complex tasks. Toolformer paper discusses teaching LLMs to use tools. More recent works involve creation of tools on the fly and using them to solve complex problems.

For instance, in digitizing paper bills or receipts, an LLM could invoke an OCR (Optical Character Recognition) tool for image-to-text conversion, followed by entity extraction for items and prices, and subsequently format the data into a desired layout, like a table, before logging it into a spreadsheet.

LangChain is an innovative framework that provides developers and scientists with the necessary abstractions to build agents. It integrates various LLMs, memory databases, and both first-party (1P) and third-party (3P) APIs, along with essential tools like file readers and writers. LangChain also incorporates popular reasoning frameworks, such as ReACT, AutoGPT, and BabyAGI, facilitating the easy integration of different LLMs and APIs for diverse applications.

Concluding Thoughts

While Large Language Models (LLMs) have significantly advanced the prospects of Artificial General Intelligence (AGI) agents, key challenges remain, particularly in learning efficiency (or sample efficiency) and long-term memory management. The current training methods for LLMs, which involve trillions of tokens, stand in stark contrast to the comparatively limited data humans are exposed to by the age of 12. This disparity has sparked studies drawing parallels between LLM training and human learning, exploring the amount of data exposure and its implications. The issue of efficiently retrieving facts and experiences from memory is another area under active investigation. Although Retrieval-Augmented Generation (RAG) presents a promising approach, achieving precise retrieval continues to be a research focus.

In upcoming posts, I plan to delve into advanced topics such as multi-agent interactions and why this is future of software architecture. Big tech is increasingly shifting from traditional software frameworks/architectures to those powered by LLMs. Andrej Karpathy likens this shift to the emergence of an ‘LLM OS’. This new paradigm integrates LLMs as the ‘brain’, databases for memory, sensors for perception, and motors or displays as actuators, culminating in a hardware-optimized LLM operating system. Additionally, Karpathy emphasizes the need for robust security measures within LLM applications, particularly to safeguard against prompt injection and hacking attempts aimed at manipulating LLMs to perform unintended actions.

Related Posts

--

--

Deepak Babu P R

Principal Scientist | ML/AI, NLP, IR and speech | love travelling, reading, trekking and photography. https://prdeepakbabu.github.io/