Building Robust AI Agents: Insights from CoALA Paper

Vaibhav Pandey
Zoom in Zoom-out
Published in
4 min readOct 11, 2023
Photo by Milad Fakurian on Unsplash

Prototyping AI agents is straightforward, but building AI agent systems that are scalable and reliable proves to be a challenging venture. This is largely due to the fact that designing an AI agent system is akin to crafting an actual thinking system.

Over the past six months, my endeavours in building AI agents through libraries like Langchain, Autogen, and no-code tools like n8n and Fine Tuner, have reinforced the belief that decisions around organising your design and code are crucial.

The recent paper titled Cognitive Architectures for Language Agents (COALA) from the team at Princeton, some of whom were part of the ReACT agent paper, addresses this challenge and presents invaluable insights on designing language agents.

This paper resonated deeply with my experiences, especially since we had experimented with many of the issues discussed therein during our projects.

What CoALA offers is a clear, systematic, and modular representation of a thinking mechanism that can be built using existing LLM (Language Model) components. It organises the structure around four fundamental components: memory, grounding, decision making, and learning, which can then be engineered accordingly.

Here’s a closer examination of these ideas:

1. Working Memory:
Working memory maintains active and readily available information as symbolic variables for the current decision cycle. In most projects where multiple calls to an LLM are made, the working memory comprises conversation memory, additional context retrieved from previous memory, or any additional context that needs preservation for the current decision cycle.

2. Past Memory:
CoALA categorises past memory into episodic and semantic memories. Episodic memory acts as the logs of previous decision cycles, while semantic memory holds the facts learned in the process. Although no specific data structure is recommended, a combination of a vector database or graph database could be employed based on the use case and type of information.

3. Procedural Memory:
This framework subdivides procedural memory into implicit and explicit memory. Implicit procedural memory is built into the language model and can be altered by modifying its weights or the model itself. Conversely, explicit knowledge resides in the agent’s code, like the prompt templates or decision logic. This is something that I learned after building few AI agent apps. It is very difficult to improve the agent code if the decision logic is intertwined with items that belong to procedural memory like the prompt templates.

4. Decision Logic/Main.py:
Language agents make choices on what to do by going through a repeated process. In this process, they use steps called reasoning and retrieval to plan their actions. They decide on a specific action, carry it out, and this action can change something in the real world or in the agent’s memory for the long term. CoALA’s suggestion, based on my understanding, is to keep decision logic in one place like main.py part of your project. The decision logic keeps running in loops, taking in new information and deciding on actions accordingly. This process is split into two parts: planning and doing. In the planning part, the program decides on what actions are possible and weighs them up. In the doing part, it carries out the chosen action. Then, it starts all over again with the next loop.

5. Library of Agents:
Instead of creating multiple agents for each use case, the paper advocates the creation of a library of agents (enterprise-wide) and then implementing an instance of the agent when needed.

Moreover, the framework discusses intriguing ideas for future directions:

1. Integrating Decision Making and Retrieval:
The connection between the quality and relevance of retrieval and decision making, a key challenge I encountered while designing a CV review agent, remains understudied. For instance, depending on the task item retrieved, how should the agent decide whether to enquire more or start executing (exploration vs exploitation)? When reviewing a CV, if the retrieval corresponds to: “evaluate if the CV is ATS friendly”, how do you make the agent know that it can ask for more knowledge to evaluate that, and how deep should it go?

2. Updating Memory:
The notion of updating the episodic, semantic, and even the procedural memory with collected experience could be a significant improvement over current implementations. While I am not aware of any LLM observability solution provider offering this, Langchain could extend the agent module to enable some automated learning through Langsmith.

3. Beyond Prompt Engineering:
The paper encourages moving beyond prompt engineering to enhance the reasoning ability of LLMs, suggesting structured output parsing to update working memory and design better prompts. Additionally, it mentions that reasoning use cases could identify needs for fine tuning LLMs.

4. Mixing Language-based Reasoning and Code-based Planning:
An exemplary idea includes the agent dynamically writing code for a simulation to acquire new knowledge, which is then utilised in the decision cycle and added to the memory.

5. Evolution of GPT/LLMs:
Though no specific ideas are proposed, the paper highlights that a CoALA design could facilitate system updates and removal of obsolete components as new capabilities are unlocked in LLMs.

In summation, I think that the CoALA framework can be a strong foundation for systematically thinking about and engineering robust language agents. It is a great way to organise your project and I am genuinely excited about the work this team is doing.

--

--

Vaibhav Pandey
Zoom in Zoom-out

Management professional | Writes on AI/Data apps, Systems thinking, and Up-skilling