The Challenges of Building Robust AI Agents

An introduction to the challenges of building AI agents on top of LLMs.

Andrew Berry
7 min readFeb 18, 2024

It seems that everyone these days are trying to hobble something together on top of an existing LLM, possibly leveraging a framework like LangChain, sprinkle in some prompt engineering, and call it a product.

Truly there are some impressive and innovative products out there. Among these, products such as AutoGPT, GPT Pilot, Ask PDF, CopyAI, and of course ChatGPT.

These are some of the more renowned ones, but doing a quick search on GitHub, you can see all kinds of attempted products. From the extremely polished to hobbyists’ tinkering around. This is great, I’m loving the innovation and these products. Some of them are great time savers.

The thing is, building great products on top of existing LLMs are hard enough and there are a few challenges that go with it. Especially if complex agents are involved with sophisticated tools.

1. How do LLMs reason?

2. How do we properly prompt engineer our LLMs?

3. Working with non-deterministic models and ensuring proper data flow.

4. Managing Hallucinations

5. Establishing Sensible Guardrails

6. Evaluation Challenges

I’ll try to explore some of these challenges and present some ideas on potential solutions. But before we move one, I think it is important to define what an Agent is? Here is my attempt at a simple definition.

An agent is defined as a system capable of interpreting and/or reasoning about a user’s intent, striving to serve the user’s wants and needs, while understanding it’s capabilities and limitations.

Now here comes the first problem. How do LLMs make decisions?

Without really going into the weeds here, we should understand that whatever we feed to an LLM, based on what is has been trained and based on how it has been fine tuned. It will output the most probable outcome.

This process is what makes the technology so amazing and magic like.

Now, it does bring up the question. Do these language models really know what they are talking about? Do they really understand what it is being said to them. Now this is already a debate amongst the community. It’s more philosophical at most. I’m still conflicted about it and on the side it does not yet.

Let’s refine our first problem, how can we instruct our LLM to make the right decision? As a start, we can prompt engineer a reasoning prompt. There are many techniques out there. One of the more robust ones are the ReAct Prompting [1] method. Frameworks like LangChain already integrates it. You can also combine it with techniques such as chain of thought or few shot prompting. There really isn’t a right answer. But for most the part, normally, you provide some form of context of what the task the agent needs to do, some form of acceptance criteria, and most often a specific method of output it needs to be in.

We craft a prompt, experiment, and then we move on. Now the key question and 2nd challenge is, was our prompt the optimal choice? This speaks to the fact that the art of prompt engineering often feels like a guesswork game. It can also be quite time consuming. Yet, there are promising strategies to overcome the challenge, such as meta-prompting and using reinforcement learning.

Meta-prompting, quite simply a prompt to engineer prompts. If you have access to the paid ChatGPT version, there is the “create your own GPT” feature which is quite good.

Meta-prompting can be combined with reinforcement learning to improve outcomes as well. [2]

We could also employ reinforcement learning with human feedback (RLHF) to further support an LLMs reasoning capabilities on your specific Agent that you’re building. ChatGPT and the reason why it is a magical product employed RLHF techniques during its development. [3] [4]

Additional challenges come up when we have multi-turn interactions. The complexity increases with interactive or conversational applications, where the context of past interactions needs to be taken in account. Another challenge is working with other LLMs that are not OpenAI’s. Majority of resources out there in terms of best practices are normally from using OpenAI’s tools. A prompt that works withone LLM does not guarantee the same success with another one. I’m sure as time goes, this will change as alternative LLMs become more accessible to folks. For example, Perplexity AI and Google’s Gemini are quite accessible now! But there are other ones such as Mixstral or LLama that are available but just not as accessible if you are not a developer.

The 3rd challenge is managing the data-flow and decisions.

After prompting your LLM and receiving a response back in plain text. Quite straightforward for simple exchanges. It is not very practical for more complex applications and tasks. What we will need is some form of structured output. A common practice within the community and among various tools is to instruct the LLM to output in a JSON format, with specified keys where values can be parsed. These instructions are specified in your prompt. Here is an example from LangChain [5]. It doesn’t have to be in JSON format, it can be even a simpler structure that you may want to engineer that can be parsed with Regex.

After getting that sorted, this follows to our 4th challenge, which is dealing with hallucinations.

Unlike traditional applications where we can expect and guarantee specific data structures. With LLMs we must deal with hallucinations, and it is no secret that LLMs can hallucinate a lot, with some funny examples.

Hallucinations are when a response from an LLM was not as expected, made-up, or plain inaccurate.

This goes back to the point that LLMs are non-deterministic models at the end of the day. This must be understood, and risk of a hallucination is always there.

A simple strategy is by reducing the temperature when calling our LLM. However, doing so can make an LLM lose its magic, but for some use cases such as reasoning agents and tools, it may be beneficial to ensure our desired outcome. This is a part of prompt engineering, and it also shares the same issues. How to find the right temperature?

Temperature is a parameter that regulates the unpredictability of the model, or think of it as a creative-o-meter.

Prompt engineering itself is often used as a mechanism to reduce hallucinations. Such as encouraging it or adding emotion can improve performance [6]. Anecdotally, I have found this to be true. Here is a funny example where threatening it led to a desirable outcome as well [7].

The 5th challenge is establishing sensible guardrails. Guardrails are essentially mechanisms that ensure that the interaction between an LLM and user is within the context of the product.

These guardrails can come in many forms. Out of the box, a lot of LLMs have some sort of fine-tuning that removes a lot of harmful content. This is a form of a guardrail.

Once again, we can also employ prompt engineering to manage this. Take this system prompt from ChatGPT [8], I verified it myself [9]. As part of their DALLE capabilities a guardrail was written in place to ensure it does not create images in style of artists post 1912.

We can also employ validators to ensure our structured outputs and desired value is what we want. It can enforce a stop gap by raising an error, which you could pass back to an LLM to correct itself. This technique also helps minimize the hallucinations in building custom agents. I have found using pydantic validators as a good start to build into your products. [10]

The 6th challenge is evaluating your agents.

Now, this is very hard. I’m sure there are many teams out there and hobbyist who evaluate their own agents with a simple excel sheet. Input Question → Did it get it right? → Yes/No. Rinse and repeat as they iterate with their product.

For simple input and output types of agents or tools, it is possible to scale it up and run it automatically. The real challenge comes in evaluating multi-turn models. Did it choose the right tool? Did it consider the previous context? Was the response good?

I am not sure and aware of a good scalable solution for evaluating multi-turn agents, besides A/B testing. It’s hard to automatically do it at scale and quite subjective at times. Say you are a building a chatbot. Indeed, it is important it stays within its context, knows the desired outputs, but what is also important is making sure the product feels good. The response make sense, it is sensible, and has the right amount of spark that makes it feel like magic. Subjective experiences are hard to evaluate, and we humans are not very good at doing that as well.

Overall, building robust LLM powered agents has its complex challenges. It requires a lot of creativity, experimentation, and quality engineering to do so. As these models get better, it will unlock even greater potential products, easier development, but for the foreseeable future the AI engineer will need to have a strong intimate understanding of the capabilities of the LLMs they are using. Being aware of its strengths and weaknesses. Most importantly a good sense of the feel of it.

Sources:

1. ReACT: Synergizing Reasoning and Acting in Language Models [https://arxiv.org/abs/2210.03629]

2. Prewrite: Prompt Rewriting with Reinforcement Learning [https://arxiv.org/pdf/2401.08189.pdf]

3. https://huggingface.co/blog/rlhf

4. https://openai.com/blog/chatgpt

5. https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/output_parsers/format_instructions.py

6. Large Language models Understand and Can be Enhance by Emotional Stimuli [https://arxiv.org/pdf/2307.11760.pdf]

7. https://twitter.com/goodside/status/1657396491676164096?t=Jd1yywsBQHJPRHGy2ioRHA

8. https://twitter.com/krishnanrohit/status/1755125297895416062

9. https://chat.openai.com/share/ccbe38ae-b25f-45cc-a4ad-14e723aa43ef

10. https://blog.pydantic.dev/blog/2024/01/04/steering-large-language-models-with-pydantic/

--

--