A Torrid Love Affair: From Frustration with OpenAI’s Black Box to Freedom in Open Cognitive Architectures

Published in

Sema4.ai

14 min readAug 28, 2024

In this second installment of the Agentic Mindset Journey blog series, we continue the story of Vitality Health AI, tracing its evolution from an early infatuation with OpenAI’s CustomGPT and Assistants API to the growing frustration that led me down a path of discovery. This post focuses on the pivotal moments captured in the first three milestones of my Agentic Mindset Journey: “From Love to Disillusionment,” “Beyond the Black Box,” and “Navigating Cognitive Architectures.” These stages, depicted in the timeline below, marked significant turning points as I transitioned from the initial success and eventual limitations of CustomGPT to embracing open cognitive architectures and, ultimately, exploring Vitality AI’s first open architecture, ReAct.

**Initial Milestones in the Agentic Mindset Journey:** This timeline captures the key phases of Vitality AI’s evolution, from the initial promise and limits of CustomGPT (Event 1) to embracing open architectures with LangGraph (Event 2) and exploring cognitive architectures with ReAct (Event 3)

In the first part of this two-part series, we delve into the initial phase of this journey — starting with the allure of CustomGPT, the challenges of its closed nature, and the eventual realization that a more transparent, flexible approach was needed. This realization set the stage for my exploration of the ReAct cognitive architecture, an initial attempt to overcome the limitations I faced.

The story of Vitality AI is one of evolution — from the promise of a simple solution to the realization that true advancement requires more. In Part 2 of this series, we will explore how these challenges led to the development of the Plan & Execute Cognitive Architecture, a structured approach that addresses the shortcomings of ReAct, setting the foundation for even more sophisticated, multi-agent systems.

A Promising Start with OpenAI’s CustomGPT

My journey with Vitality AI began with what seemed like the perfect solution — OpenAI’s CustomGPT. The problem I aimed to solve was straightforward: create an intuitive, conversational AI interface to analyze my Apple Health workout data. Frustrated by the Apple Health App’s limitations, I sought a more natural, seamless way to interact with my health metrics.

To bring this vision to life, I built the first incarnation of the Vitality Agent using CustomGPT. I crafted a data pipeline to transfer workout data from my Apple Health app to a health data lake powered by Atlas MongoDB. Leveraging Robocorp’s (now Sema4.ai) Action Server and Python, I developed a set of Workout and Health Metrics OpenAPI endpoints.

As the system evolved, so did its scope. I expanded it to encompass a broader range of health data, including FHIR-based medication records, health conditions, surgeries, clinical vitals, lab results, immunizations, and allergies. Within three months, my data pipelines were fully operational, transforming my Health Lake into a comprehensive repository of my health information — all accessible via OpenAPI endpoints powered by Robocorp Actions (now Sema4.ai Actions).

As illustrated in the image below, this expanded system allowed for seamless access and interaction with my health data, turning my initial project into a robust health data analysis platform.

**Building the Foundation:** The integration of OpenAPI endpoints with Sema4.ai Actions laid the groundwork for Vitality AI, enabling seamless interaction with a comprehensive health data lake.

I vividly recall the first time I watched Vitality seamlessly execute a multi-hop query that triggered 8 different action endpoint calls, all perfectly orchestrated with precise pre-processing and post-processing steps, including generating Python code with Pandas using the Code Interpreter to create insightful graphs. The image below captures this experience which was nothing short of mesmerizing.

**Vitality AI in Action**: The initial implementation of Vitality AI with OpenAI’s CustomGPT showcased how conversational AI could revolutionize health data analysis. It provided intuitive interactions and immediate insights, as demonstrated by using the Code Interpreter to generate visual representations like the graph above.

The interface was intuitive, and the results were immediate. I no longer needed to wrestle with dashboards; I had a conversational partner that could provide the insights I needed with a simple query. For the first time, I felt in control of my health data.

This new approach felt revolutionary — offering a seamless experience where complex data analysis was reduced to a simple conversation.

I was in love and eager to show everyone I knew — this felt like a game changer, or so I thought.

The Honeymoon Ends: Encountering Limitations

As the honeymoon phase faded, reality set in. My excitement turned to frustration as I pushed the system to handle more complex queries that required cross-entity analysis. I began asking more valuable, holistic questions — queries that spanned multiple health entities, such as medications, labs, workout metrics, and clinical notes from doctor visits. The questions were deeper and more meaningful:

“Can you analyze how my medication changes, lab results, and fitness activities influenced my overall health over the past decade?”
“Based on the top 5 major diseases that men in their mid-40s face, can you review my health over the last two years and identify areas of concern?”
“How have my exercise routines, dietary changes, and medication adjustments over the past 10 years influenced my cardiovascular health?”

Unfortunately, as I pushed the boundaries, I began to encounter significant limitations. The more actions and tools I provided to CustomGPT, the more it became prone to hallucinations and inconsistent results. For instance, when a query required cross-entity analysis across medications, lab results, procedures, and workout metrics, ChatGPT would claim to call all the relevant entities but would only actually engage with two or three. The rest of the data seemed to be fabricated, merged with real information in a way that made it difficult to trust the results.

Consider the following health query:

“Based on the top 5 major diseases that men in their mid-40s face, can you review my health over the last two years and tell me what I should be most worried about and talk to my new doctor about?”

ChatGPT responded with:

“To provide a comprehensive review, I’ll begin by identifying the top five major diseases men in their mid-40s commonly face. Then, I’ll analyze your health data over the last two years to determine any relevant concerns. Based on the major diseases that men in their mid-40s typically face, I reviewed your recent lab results to identify any potential areas of concern. Here’s a summary of your lab results and what you should consider discussing with your new doctor…”

The problem was clear: I needed ChatGPT to not just analyze my lab results but to integrate data across all the health entities it had access to — medications, workout metrics, clinical notes, and more. However, I found myself asking, “How can I effectively communicate my needs to ChatGPT? What control do I have over the LLM’s behavior?”

ChatGPT exposed only two levers: Instructions and Knowledge. I started with detailed instructions but quickly hit OpenAI’s character limits. To work around this, I created two key knowledge documents: the Comprehensive Health Data Action Guide, which detailed the rationale and objectives for each health-related action, and the Runbook for Vitality AI, which provided specific, step-by-step processes for executing these actions across various health modules like lab tests, medications, and workouts.

**Limited Control, Limited Flexibility**: Despite the detailed instructions and knowledge documents provided, OpenAI CustomGPT’s reliance on just two control knobs — Knowledge and Instructions — failed to meet the complex and flexible needs of Vitality AI.

As you can see in the above image, my instructions referenced these documents extensively, but after five days and a 40+ page runbook, I was no better off.

Even attempts to use OpenAI’s Assistant API, which offered the flexibility to build AI agents within applications, failed to resolve the core issue: the lack of fine-grained control over how the LLM planned, reasoned, and made decisions.

I had hit a wall — the very simplicity and ease of use that drew me to CustomGPT were now the relationship killers. It was a black box that traded off control for convenience, and this lack of transparency and flexibility meant it was no longer the right fit for my increasingly complex needs. I knew it was time to move on.

Discovering Open Cognitive Architectures

As I wrestled with the limitations of OpenAI’s CustomGPT, I realized that the very factors that made it easy to use were also its greatest weaknesses. The lack of transparency, the inability to fine-tune the underlying processes, and the unpredictability in handling complex queries were all major obstacles. My frustrations led me to search for an alternative — something that would offer the flexibility and control I needed to develop a more robust and reliable AI system. The closed nature of CustomGPT, operating as a black box, provided convenience but at the cost of flexibility and precision, which were essential for the increasingly complex needs of Vitality AI.

It was during this search that I stumbled upon Harrison Chase’s blog, “OpenAI’s Bet on a Cognitive Architecture.” In it, Harrison defines cognitive architecture as the orchestration framework for an LLM application, focusing on how context is provided to the application and how the application performs reasoning. His arguments against closed cognitive architectures resonated deeply with me, as they mirrored the challenges I was facing with Vitality AI.

One particular statement from Harrison’s blog stood out: “Much of the discourse around open-source vs closed-source revolves around the models themselves. But there is another element to consider: open-source cognitive architectures vs closed-source cognitive architectures.”

This insight was a turning point. Reflecting on Harrison’s comments, I realized that the debate between open and closed cognitive architectures could become even more significant for large enterprises than the current conversation around open vs. closed LLMs. My experience with Vitality AI underscored this, especially in the context of complex healthcare applications where the handling of sensitive health data is paramount.

The control over cognitive architecture is crucial because it directly influences how an AI system plans, reasons, and makes decisions. Throughout the development of Vitality AI, I encountered numerous challenges in ensuring that agents could accurately analyze and integrate data from various health entities. The closed nature of OpenAI’s CustomGPT and Assistant API meant that I couldn’t fine-tune or optimize the architecture to meet my specific needs. This lack of control led to unreliable and sometimes incomplete outcomes — frustrating, to say the least.

Moreover, the risks associated with being locked into a closed cognitive architecture extend far beyond those of cloud resources or LLMs. While cloud services and LLMs operate at the infrastructure level, cognitive architectures are integral to the functionality of an application. For large enterprises, relying on a closed, external provider for their cognitive architecture restricts their ability to adapt, customize, and evolve their systems to meet specific business needs. It also complicates integration with existing systems and can pose significant security and compliance risks, leading to strategic and operational vulnerabilities.

This realization marked a critical turning point in my journey towards embracing an Agentic Mindset. I understood that in the rapidly evolving landscape of AI applications, openness and control weren’t just desirable — they were essential. With this newfound clarity, I set out to find a solution that offered the flexibility and control I needed for Vitality AI, leading me to the LangGraph framework and the OpenAI GPT-like experience built using LangGraph, known as OpenGPTs.

Exploring Three Cognitive Architectures for Vitality AI

When building Vitality AI, a critical step was determining the right cognitive architecture to power the application’s capabilities. This journey involved creating three distinct agents, each utilizing a different cognitive architecture to identify the best fit for the task at hand. These architectures were built using OpenGPTs and LangGraph, which provided the necessary tools and flexibility to explore and refine each model.

**Exploring Cognitive Architectures:** This image illustrates the three cognitive architectures tested within Vitality AI: ReAct, Plan & Execute, and Multi-Agent Plan & Specialize, each providing unique approaches to optimize decision-making and task execution.

The image above shows the three Vitality AI agents created to evaluate different cognitive architectures.

Single-Agent Assistant — ReAct: This is the default and when I developed this the only cognitive architecture in OpenGPT. This architecture allows arbitrary tool use with an LLM-driven decision-making process. It’s powerful due to its flexibility but can be unreliable because of its lack of structured planning.
Single-Agent — Plan & Execute: This architecture introduces a more structured approach, where the agent plans its actions based on user-specific data and executes them in a sequential manner. This structure provides more reliability but might still struggle with highly complex tasks that require multi-faceted reasoning.
Hierarchical Multi-Agent — Plan & Specialize: As the more sophisticated architecture, this setup coordinates multiple specialized agents to execute tasks collaboratively. Each agent operates under a central planner, ensuring that complex queries are dissected and addressed comprehensively. This architecture is designed to maintain contextual awareness and adapt to new information effectively. This was was a custom agent architecture I developed

OpenGPT offers a comprehensive stack necessary for building sophisticated AI applications. This includes LLMs, knowledge management, instructions and prompts, and tools for interacting with external systems. What makes OpenGPT particularly powerful is its flexibility — it allows developers to choose the most suitable LLMs and cognitive architecture for their specific use cases, whether it’s the out-of-the-box ReAct model or a custom-built hierarchical multi-agent system.

This flexibility is clearly evident in the following image, showcasing how these cognitive architectures and LLMs can be selected and tailored to meet the specific requirements of Vitality AI:

**Comprehensive AI Customization:** This image highlights the flexibility of OpenGPT, allowing developers to tailor their AI systems by selecting the most suitable LLM providers, cognitive architectures, and action endpoints. Whether using the out-of-the-box ReAct model or a custom hierarchical multi-agent system, developers can optimize for specific use cases, cost, and requirements.

Cracking Open the Black Box: Unlocking the Potential of Open Cognitive Architectures with ReAct

With this context set, I began my exploration of open cognitive architectures, starting with ReAct. The ReAct cognitive architecture integrates reasoning and action execution into a cohesive process. It operates by calling an LLM repeatedly in a loop. At each iteration, the agent determines which tools/actions to use/call and the exact inputs needed. After executing the selected tools, the outputs are fed back into the LLM as observations for the next iteration. This process continues until the agent decides that no further tool calls are necessary. The following diagram illustrates the ReAct cognitive architecture used for Vitality.

**Optimizing Task Execution with ReAct:** This diagram illustrates how Vitality AI utilizes the ReAct cognitive architecture to break down and execute complex health queries. By integrating reasoning and action execution in a loop, the system can make parallel calls to multiple tools, optimizing performance with OpenAPI endpoints powered by Sema4.ai Actions.

This architecture allowed Vitality to break down complex tasks into manageable actions, which it executed in parallel if the LLM responded with multiple tool calls for a given prompt. This was a significant optimization over OpenAI’s CustomGPT, which only supported sequential tool invocations even if the LLM suggested multiple actions.

To illustrate this optimization, consider the following query:

“How many miles did I run last year? Which was the longest from a distance perspective, and how many active calories did it burn? Did this run burn the most calories across ALL my workouts last year?”

**Handling Multi-Hop Queries with ReAct:** *Vitality AI’s response to a complex query, analyzing workout types to find the most calorie-intensive activity.*

The video demonstrates how Vitality AI handles this multi-hop query using both CustomGPT and ReAct architecture. The video showcases the difference in how these architectures manage action calls — sequentially in CustomGPT versus parallel execution in ReAct — and illustrates the transparency and efficiency gained through open cognitive architectures like ReAct.

The trace below highlights the optimization, showing the parallel execution of multiple calls to different workout types:

**Optimizing Parallel Execution with ReAct:** This LangSmith trace shows how the ReAct architecture enables Vitality AI to break down a multi-hop query into parallel calls across different workout types, highlighting the transparency and efficiency gained with open cognitive architectures.

This trace demonstrates how the ReAct architecture breaks down the query into five parallel calls to various workout types. This capability showcases the power of open architectures, allowing developers to debug, trace, and optimize the AI’s decision-making process. This level of transparency and control is impossible to achieve with OpenAI’s CustomGPT, which operates as a closed black box.

By using an open architecture like LangGraph, developers can gain insights into the internal workings of their AI systems. They can see exactly how decisions are made, how actions are executed, and where optimizations can be applied. This ability to “crack open the black box” is crucial for developing sophisticated, reliable AI applications like Vitality.

Limitations of the ReAct Architecture

Despite demonstrating parity with the results obtained using ChatGPT and even optimizing processing time, running Vitality AI on the ReAct architecture revealed significant limitations. Although the parallel processing capability was a powerful optimization, it did not address more complex cross-entity queries effectively. The architecture’s lack of formal planning and reasoning structures became apparent when attempting to analyze and correlate data across multiple health entities.

The following two sections highlight these issues.

Inadequate Planning for Complex Queries:

Consider the following query using the ReAct cognitive architecture with the OpenAI GPT-4 Turbo model:

“Based on the top 5 major diseases that men in their mid-40s face, can you review my health over the last two years and tell me what I should be most worried about and discuss with my new doctor?”

The below image shows GPT-4 Turbo creates a plan but focuses primarily on lab results, neglecting other critical aspects such as medication adherence, fitness data, and lifestyle changes. This narrow focus demonstrates the ReAct architecture’s limitation in providing a comprehensive health review.

**Limitations in Complex Query Planning:** This image shows GPT-4 Turbo’s plan for a complex health query, highlighting its narrow focus on lab results while neglecting other critical aspects like medication adherence and fitness data, demonstrating the limitations of the ReAct architecture in comprehensive health analysis.

The LangSmith trace confirmed that the LLM used only lab results to answer this query, resulting in an incomplete and less insightful response.

**Incomplete Query Resolution with ReAct:** *This LangSmith trace illustrates how the ReAct architecture, using GPT-4 Turbo, focused solely on lab results for a complex health query, leading to an incomplete and less insightful response.*

Here is Vitality’s response to the query above, which highlights how inadequate planning focused only on lab results resulted in an incomplete and less insightful response.

**Incomplete Health Analysis Due to Narrow Focus:** *This image illustrates how Vitality AI’s response, limited to lab results, resulted in an incomplete and less insightful health summary, missing key aspects like medications and lifestyle factors.*

Improved Planning but Execution Challenges with GPT-4o

Would updating to OpenAI’s latest model, GPT-4o, while still using the ReAct architecture, improve the response to complex health queries? The following diagram provides the answer to this question.

**Improved Planning but Execution Gaps with GPT-4o:** Updating to GPT-4o in the ReAct architecture led to better planning, incorporating more comprehensive health data, but execution challenges remained, including incomplete data retrieval and deviations from the plan.

Upon updating to GPT-4o, the initial plan generated was indeed more comprehensive compared to the earlier version, GPT-4 Turbo. The newer model produced a plan that incorporated a broader range of health data, indicating a significant improvement in the planning stage.

Despite the improved planning, the LangSmith trace logs revealed two issues in the execution phase: incomplete data retrieval and deviation from the initial plan.

For example, the LLM decided to call two actions:

get_entire_medication_history with the specified date range.
list_lab_tests_by_category.

While the medication history was successfully retrieved, the lab test codes fetched were not followed up with the necessary calls to retrieve the actual lab results. The ReAct architecture often fails to fully execute the initial plan for complex cross-entity health queries, especially those requiring multiple steps. The lack of an explicit planning step in the ReAct architecture caused the LLM to deviate from its initial plan, resulting in incomplete analysis and recommendations.

These examples underscore the limitations of the ReAct architecture when handling more sophisticated, multi-step queries. While the updates with GPT-4o improved some aspects of planning, the overall cognitive framework remained a bottleneck. This highlights the pressing need for a more robust solution — one that incorporates structured planning and execution to address these challenges comprehensively.

The Next Step in the Agentic Mindset Journey: Towards a More Structured AI Future

As I explored the ReAct cognitive architecture, I was impressed by its ability to break down complex tasks and execute them in parallel. However, its limitations became evident when dealing with intricate, multi-entity queries. While ReAct was a significant step forward from the closed systems I had previously worked with, it became clear that a more structured approach was necessary to meet the growing demands of Vitality AI.

ReAct provided valuable insights, but the increasing complexity of Vitality AI highlighted the need for an architecture that could plan more effectively and execute with greater precision. In the next part of this journey, we’ll delve into the Plan & Execute cognitive architecture — a more structured approach designed to address the challenges that ReAct encountered. This architecture introduces explicit planning and replanning steps, ensuring that even the most complex health queries are handled with accuracy and thoroughness.

I’m excited to share that Part 2 of this blog will be coauthored with my engineering counterpart at Sema4.ai, Sunil Govindan, who will bring an engineering perspective on the lessons learned during our transition. One key takeaway, as Sunil will discuss, was the realization that as we moved beyond OpenGPT, we needed to build an enterprise AI agent platform — one that is not just another AI playground for toy applications, but a robust, scalable system engineered for mission-critical, enterprise-scale AI deployments, capable of handling the scale, complexity, and customization required by enterprises. Vitality AI, with its realistic use case complexity, provided the insights necessary to guide the development of this platform, ensuring it meets the needs of enterprise customers.

Stay tuned for Part 2, “From Frustration to Clarity: Embracing the Plan & Execute Cognitive Architecture,” which we’ll publish next week.