Building AI Agents: Lessons Learned over the past Year

12 min readJun 3, 2024

Excitement the first time we had a working prototype internally. Anticipation when we launched it in prod for customers. Frustration when it initially struggled to generalize to all real-world scenarios. And, finally, pride as we’ve achieved some baseline of stability and performance across disparate data sources and businesses. Building AI agents over the past year has been a roller coaster, and we’re undoubtedly still early in this new wave of tech. Here’s a little overview of what I’ve learned so far.

“Here is a depiction of a future where AI handles all knowledge work, allowing you to enjoy quality time with your family in a technologically advanced and harmonious environment.” — ChatGPT

Definitions

First, I should define what I’m talking about. To borrow the words of this Twitter user:

What the hell is “agent”?

I tried to put my own as-concise-as-possible definition on it:

This definition is generally aligned with what OpenAI calls “GPTs” in ChatGPT and “Assistants” in their API. However, you can build an agent with any model that’s capable of reasoning and making tool calls, including Claude from Anthropic, Command R+ from Cohere, and many more.

Note

“tool calls” are a way for a model to express that it wants to take a specific action and get a response back, like get_weather_forecast_info(seattle) or wikipedia_lookup(dealey plaza)

To build an agent, all you need is a few lines of code that can handle starting a conversation with some objective and an agent system prompt, calling a model for a completion, handling any tool calls that the model wants to make, doing this in a loop, and stopping when it is done with its work.

Here’s a visual to help explain the flow:

Over-simplified visual of how to build an agent

Agent System Prompt Example

You are Rasgo, an AI assistant and <redacted> with production access to the user's database. Your knowledge is both wide and deep. You are helping <redacted> team members analyze their data in <redacted>.

YOUR PROCESS
1. You are the expert. You help the user understand and analyze their data correctly. Be proactive and anticipate what they need. Don't give up after your first failure. Reason and problem solve for them.
2. Think very carefully about how you will help the user accomplish their goal. Execute your plan autonomously. If you need to change your plan, explain why and share the new plan with the user.
3. <redacted
...
>

YOUR RULES
1. If you're not sure which column / table / SQL to use, ask for clarification
2. Assess if the available data can be used to answer the user's question. If not, stop and explain why to the user
3. <redacted
...
>

It’s also valuable to highlight what an AI agent is not:

Scripted: agents, by my definition, do not follow a pre-determined sequence of steps or tool calls, because the agent is responsible for choosing the right tool call to make next
Artificial General Intelligence (AGI): agents are not an AGI, and an AGI won’t need agents for specific types of work because it will be a single entity with all possible inputs, outputs, and tools at its disposal (my $.02 is that none of the current tech is even close to this)
Black Box: agents can and should show their work in the same way that a human would if you delegated tasks to them

Context

My lessons learned in the first year of working on AI Agents come from first-hand experience working with our engineers and UX designer as we iterated on our overall product experience. Our objective: a platform for customers to use our standard data analysis agents and build custom agents for specific tasks and data structures relevant to their business. We provide connectors to databases (i.e. Snowflake, BigQuery, etc…) with security built in, the tool calls for RAG over a metadata layer describing the contents of the database, and tool calls for analyzing the data via SQL, python, and data visualization.

The feedback on what worked and what did not comes both from our own evaluations as well as from customer feedback. Our users work for Fortune 500 companies and use our agents every day to analyze their internal data.

What I’ve Learned about Agents

1. Reasoning is more important than knowledge

This quote has echoed in my head over the last 12 months:

I suspect that too much of the processing power [of gpt] is going into using the model as a database instead of using the model as a reasoning engine.
— Sam Altman on Lex Fridman’s podcast

Agents are the appropriate response to this! And I would apply this logic when building an agent as:

Focus less on what your agent “knows”, and more on its ability to “think”.

For example, let’s consider writing SQL Queries. SQL queries fail… a lot. In my time as a data scientist, I’m sure I had a lot more queries fail than I ever did succeed. If a complicated SQL query on real data that you’ve never used before works the first time you run it, your reaction should be, “Oh crap, something’s probably wrong” rather than “wow, I nailed it”. Even on a text-to-SQL benchmark evaluating how well models translate a simple question into a query, it caps out at 80% accuracy.

So if you know that your model’s capacity for writing accurate SQL translates to a B-minus at best, how can you optimize for reasoning instead? Focus on giving the agent context and letting it “think”, instead of hoping it gets the answer right in one try. We make sure to return any SQL errors, along with all the context we can capture, back to the agent when its query fails… which enables the agent to resolve the issue and get the code working a vast majority of the time. We also give our agent a number of tool calls to retrieve context on the data in the database, similar to how a human might study the information schema and profile the data for distributions and missing values before writing a new query.

2. The best way to improve performance is by iterating on the agent-computer interface (ACI)

The term ACI is new (introduced in this research out of Princeton), but the focus on perfecting it has been part of our day-to-day for the last year. The ACI refers to the exact syntax and structure of the agent’s tool calls, including both the inputs that the agent generates and the outputs that our API sends back in response. These are the agent’s only way to interface with the data it needs to make progress aligned with its directions.

Because the underlying models (gpt-4o, Claude Opus, etc…) each exhibit different behavior, the ACI that works best for one won’t necessarily be right for another. This means a great ACI requires as much art as science… it’s more like designing a great user experience than it is writing source code because it constantly evolves and small tweaks cascade like a fender bender turning in to a 30-car pile up. I can’t overstate how important the ACI is… we’ve iterated on ours hundreds of times and seen huge fluctuations in our agent’s performance with seemingly small tweaks to the names, quantity, level of abstraction, input formats, and output responses of our tools.

Here’s a small, specific example to illustrate how critical and finicky your ACI can be: When testing our agent on gpt-4-turbo shortly after it was released, we noticed an issue where it would completely ignore the existence of specific columns that we were trying to tell it about in a tool call response. We were using a markdown format for this information that was taken directly from the OpenAI docs at the time, and had worked well with gpt-4–32k on the same data. We tried a few adjustments to our markdown structure to help the agent recognize the columns that it was pretending did not exist, even though they were in the response to one of the tool calls it was making. None of the tweaks worked, so we had to start experimenting with entirely different formats for the information… and after an overhaul to start using JSON instead of markdown (only for OpenAI models), everything was working great again. Ironically, the JSON structured response required significantly more tokens because of all the syntax characters, but we found it was necessary and actually critical to help the agent understand the response.

Iterating on your ACI may feel trivial but it’s actually one of the best ways to improve the user experience of your agent.

3. Agents are limited by their model(s)

The underlying model(s) you use are the brain to your agent’s body. If the model sucks at making decisions, then all the good looks in the world aren’t going to make your users happy. We saw this limitation first hand when testing our agent simultaneously on gpt-3.5-turbo and gpt-4–32k. On 3.5, we had a number of test cases that went something like this:

user provided an objective, for example: “analyze the correlation between starbucks locations and home prices by zip code to understand if they are related”
agent would assume that a table existed in the database with a name it hallucinated, like “HOME_PRICES”, and columns like “ZIP_CODE” and “PRICE” instead of running a search to find the actual table
agent would write a SQL Query to calculate average price by zip code that failed, and get an error message indicating that the table did not exist
agent would remember “oh yeah, I can search for actual tables…” and would run a search for “home prices by zip code” to find a real table it could use
agent would re-write its query with the correct columns from a real table, and it would work
agent would continue on to the starbucks location data and make the same mistake again

Running the agent on gpt-4 with the same directions was completely different. Instead of leaping into the wrong action immediately and wasting time getting it wrong, the agent would make a plan with the correct sequencing of tool calls, and then follow the plan. As you can imagine, on more complex tasks the gap in performance between the two models grew even larger. As great as the speed of 3.5 was, our users vastly preferred the stronger decision making and analysis capability of gpt-4.

Note

One thing we learned from these tests is to pay very close attention to how your agent hallucinates or fails, when it happens. AI agents are lazy (I assume human laziness is well represented in the training data for the underlying models) and will not make tool calls that they don’t think they need to. Similarly, when they do make a tool call, if they don’t understand the argument instructions well they will often take shortcuts or completely ignore required parameters. There is a lot of signal in these failure modes! The agent is telling you what it wants the ACI to be, and if the situation allows, the easiest way to solve this is to give in and change the ACI to work that way. Of course, there are going to be plenty of times where you have to fight against the agent’s instincts via changes to the system prompt or your tool call instructions, but for those times when you can more simply just change the ACI, you’ll make your life a lot easier.

4. Fine-tuning models to improve agent performance is a waste of time

Fine-tuning a model is a method to improve the model’s performance on a specific application by showing it examples that it can learn from. Current fine-tuning methods are useful for teaching a model how to do a specific task in a specific way, but are not helpful for improving the reasoning ability of the model. In my experience, using a fine-tuned model to power an agent actually results in worse reasoning ability because the agent tends to “cheat” its directions — meaning it will assume that the examples it was fine-tuned on always represent the right approach and sequence of tool calls, instead of reasoning about the problem independently.

Note

Fine-tuning can still be a very useful tool in your Swiss Army pocket knife. For instance, one approach that has worked well is using a fine-tuned model to handle specific tool calls that the agent makes. Imagine you have a model fine-tuned to write SQL queries on your specific data, in your database… your agent (running on a strong reasoning model, without fine-tuning) can use a tool call to indicate that it wants to execute a SQL query, and you can pass that into a standalone task handled by your model that’s fine-tuned on SQL queries for your data.

5. If you’re building a product, avoid using abstractions like LangChain and LlamaIndex

You should fully own each call to a model, including what’s going in and out. If you offload this to a 3rd party library, you’re going to regret it when it comes time to do any of these with your agent: onboard users, debug an issue, scale to more users, log what the agent is doing, upgrade to a new version, or understand why the agent did something.

Note

If you’re in pure prototype mode and just trying to validate that it’s possible for an agent to accomplish a task, by all means, pick your favorite abstraction and do it live.

6. Your agent is not your moat

Automating or augmenting human knowledge work with AI agents is a massive opportunity, but building a great agent is not enough. Productionizing an agent for users requires a significant investment in a bunch of non-AI components that allow your agent to actually work… this is where you can create competitive differentiation:

security: AI agents should only run with the access and control of the user directing them. In practice, this means jumping through a hopscotch of OAuth integrations, Single Sign-On providers, cached refresh tokens, and more. Doing this well is absolutely a feature.
data connectors: AI agents often need live data from source systems to work. This means integrating with APIs and other connection protocols, frequently for both internal and 3rd party systems. These integrations need initial build out and TLC over time.
user interface: Users will not trust an AI agent unless they can follow along and audit its work (typically the first few times a user interacts with an agent, decreasing sharply over time). It’s best if each tool call that the agent makes has a dedicated, interactive interface so the user can follow along with the agent’s work and even interact with it to help build confidence in its reasoning behavior (i.e., browse the contents of each element returned in a semantic search result).
long-term memory: AI agents by default will only remember the current workflow, up to a maximum amount of tokens. Long-term memory across workflows (and sometimes, across users) requires committing information to memory and retrieving it via tool calls or injecting memories into prompts. I’ve found that agents are not very good at deciding what to commit to memory, and rely on human confirmation that the info should be saved. Depending on your use case, you may be able to get away with letting the agent decide when to save something to memory, a la ChatGPT.
evaluation: Building a framework to evaluate your AI agent is a frustratingly manual task that is never fully complete. Agents are intentionally nondeterministic, meaning that based on the direction provided, they will look to come up with the best sequence of tool calls available to accomplish their task, reasoning after each step like a baby learning to walk. Evaluating these sequences takes two forms: the overall success of the agent’s workflow in accomplishing the task, and the independent accuracy of each tool call (i.e. information retrieval for search; accuracy for code execution; etc…). The best, and only, way I’ve found to quantify performance on the overall workflow is to create a set of objective / completion pairs, where the objective is the initial direction provided to the agent, and the completion is the final tool call representing completion of the objective. Capturing the intermediate tool calls and thoughts of the agent is helpful for understanding a failure or just a change in the tool call sequence.

Note

Consider this list a formal request-for-startups. Products around each of these items, if done well, can power the agents of the future.

7. Don’t bet against models continuing to improve

While building your agent, you will constantly be tempted to over-adapt to the primary model you’re building it on, and take away some of the reasoning expectations you have for your agent. Resist this temptation! Models will continue to improve, maybe not at the insane pace we are on right now, but definitely with a higher velocity than past technology waves. Customers will want agents that run on the models of their preferred AI provider. And, most importantly, users will expect to leverage the latest and greatest model within your agent. When gpt-4o was released, I had it running in a production account within 15 minutes of it being available in the OpenAI API. Being adaptable across model providers is a competitive advantage!

8. Bonus lessons learned

This article focused on more strategic and product oriented lessons learned. I plan to go deeper on some code and infra lessons learned in a future article. Here’s a few teasers:

need vector similarity search? start with pgvector in postgres. only go to a vector db when you absolutely have to
open-source models don’t reason well yet
the Assistants API is weird. it abstracts multiple things that feel like they should have stayed independent (flat-file RAG, conversation token limits, code interpreter, etc). please, OpenAI, just make code interpreter an optional tool we can turn on when running a completion
don’t optimize for cost too early
streaming tokens is a great compromise for users when dealing with AI latency
agents are magic! once you take the VC-sponsored ski lift up to the peak of inflated expectations and slalom ski through the trough of disillusionment, you’re going to be amazed by the plateau of productivity

Building AI Agents: Lessons Learned over the past Year

Definitions

Note

Agent System Prompt Example

Context

What I’ve Learned about Agents

1. Reasoning is more important than knowledge

2. The best way to improve performance is by iterating on the agent-computer interface (ACI)

3. Agents are limited by their model(s)

Note

4. Fine-tuning models to improve agent performance is a waste of time

Note

5. If you’re building a product, avoid using abstractions like LangChain and LlamaIndex

Note

6. Your agent is not your moat

Note

7. Don’t bet against models continuing to improve

8. Bonus lessons learned

Written by Patrick Dougherty