Hard-Earned Lessons from a Year of Building AI Agents
A little background first — I work as part of a team at IBM Research with the mission to incubate disruptive technologies. Over the past year, we have been laser-focused on enabling a diverse set of builders to harness the power of AI agents to solve their unique use cases.
In this whirlwind of AI development, I thought it would be valuable to pause and distill key learnings from the past year. I hope these insights prove helpful!
Starting Premises
Early in 2024, we made two key observations from our prior work incubating Generative AI models:
- LLMs can be harnessed for higher complexity problem-solving. By combining clever engineering with advanced prompting techniques, models could go beyond what few-shot learning could achieve. Retrieval-Augmented Generation (RAG) was an early example of how models could interact with documents and retrieve factual information. LLMs can also can dynamically interact with a designated environment via tool calling. When combined with chain-of-thought prompting, these capabilities laid the foundation for what would later be known as AI agents.
- Non-experts struggled to capture the promised productivity gains of generative AI. While generative AI offers vast potential, many users— struggled to translate its capabilities into solving high-value problems. On the other hand, we observed that the teams that were successful at unlocking AI’s potential had deep expertise in both LLMs and systems engineering.
This led us to an early hypothesis: Could we empower a broader user base to harness AI to solve their everyday problems?
Kanju Qiu from Imbue captured this vision perfectly in this quote:
“We want to make it much easier for relatively technical people to make software. And then much easier for non-technical people to make software. And then eventually much easier for everyone to make software. And then we won’t think of what we’re doing as making software.”
Early Proof Points
To test our hypothesis, we rapidly prototyped an AI agent that — at the time — outperformed commercially available search solutions. This agent could break down complex queries into sub-steps, navigate multiple web pages to gather supporting information, execute code, and synthesize a well-reasoned response.
What made this implementation especially interesting was that it relied on an open-source model, Llama 3–70B-Chat, which unlike OpenAI’s o1 and DeepSeek-R1, did not have built-in reasoning capabilities. We were able to harness agentic capabilities from this model by implementing the architecture detailed in this article.
To make this agent useful for a broad audience, we had to innovate beyond the traditional Chat GUI experience. A key addition we made was the trajectory explorer, a visual tool that allowed users to investigate the steps taken by the agent to generate its response. This feature significantly improved transparency and user trust.
We tested this prototype with early adopters and uncovered several key insights:
- Agent trajectory exploration is a vector of trust. Users didn’t just want a correct answer — they wanted to understand how the agent arrived at it. The ability to inspect reasoning steps increased confidence in the system.
- How the problem is solved matters. Users demonstrated a strong preference for reasoning methods that aligned with their expectations. For example, when solving math problems, they trusted calculator-based solutions over results scraped from search engines. This underscored the importance of aligning agent reasoning with user expectations.
- Agents amplify existing security challenges. AI agents introduce new security risks by interacting dynamically with external systems. Potential vulnerabilities range from unintended system modifications to incorrect or irreversible actions.
- Enterprise adoption requires enforceable business rules. While prompting can guide AI behavior, it does not guarantee strict adherence to rules. This highlights the need for a comprehensive set of rule enforcement mechanisms for AI agents.
From Prototype to Open-Source Launch
With these lessons in mind, we decided to double down on empowering the everyday builder, using their needs to drive requirements throughout the stack.
One of the first implementation decisions we made early on was to develop our own agent framework to power our agents. We evaluated existing tools for building agents and found a clear gap in addressing the needs of full-stack developers. This led to the development of the BeeAI Framework, a TypeScript-based library built specifically to fill this gap.
Following advice from our mentors, we chose to open-source the entire stack to see what resonated with the community. This allowed us to reach a diverse range of users and measure real-world traction.
We quickly found an audience with TypeScript developers, who leveraged BeeAI’s capabilities to build innovative applications. Notably, Bee Canvas and UI Builder emerged as standout community implementations that showcased the potential of agent-based systems and new human-agent interaction paradigms.
In addition, our most vocal users were developers engaging with the open-source UI, which was initially designed for non-technical users. This surprising discovery highlighted the importance of creating delightful and intuitive ways for developers to interact with, consume, and demo their AI agents.
The Hard-Earned Lessons
On creating something that users love:
- Focus on one persona at a time. Trying to serve multiple user types too soon can dilute impact. While we were excited about reaching everyday builders, our initial traction came from developers — this is the persona we are doubling down on going forward. Additionally, Python is key to unlocking broader adoption, as it remains the dominant language for AI development and allows us to reach a wider technical audience. Bringing the Python library to feature parity with Typescript is our top priority.
- Deliver a clean developer experience. A great developer experience makes it easy for newcomers to get value while providing flexibility for advanced users. Initially, our spread of repositories led to friction, we are now working to consolidate and create a more seamless experience.
- Iterate quickly and track what resonates. The AI agent space is evolving rapidly, with no clear leader yet. The best approach is to stay agile — focus on user needs, experiment fast, and refine based on real-world usage. We are still in the early days, and thoughtful execution can set strong ideas apart.
And when it comes to building production-grade agents:
- Agent architectures exist on a spectrum. There is no one-size-fits-all approach to multi-agent architectures. Teams that successfully take agents to production typically design custom architectures tailored to their specific use cases and acceptance criteria. A robust framework should accommodate this diversity, allowing users to implement architectures that best align with their unique requirements. Our initial implementation was opinionated — we designed BeeAgent with our own needs in mind. However, as we gained insights, we renamed it as ReActAgent to acknowledge that no single agent design is definitive and we have plans to add more out-of-the-box single agents. Additionally, we introduced workflows, a flexible multi-actor orchestration system that enables users to design and implement agent architectures tailored to their specific requirements.
- Consumption and interaction modalities matter. The way end users interact with agents can be just as important as the technical implementation. For example, Anthropic’s Artifact feature made it significantly easier of users to reap the benefits of highly iterative LLM-powered workflows, such as editing documents or schemas. As our learnings have signaled, agent-to-agent and agent-to-human interaction is a space ripe for innovation, and this is a key area we plan to focus on in the future. Expect more updates soon!
- Evaluation is key — and more challenging than ever. Without existing benchmark datasets and tooling that fit our needs, we had to develop our own processes from scratch. Our approach involved defining every feature our agent should support and constructing a custom benchmark dataset. Features ranged from simple aspects like controlling the tone or verbosity of responses to the accuracy of more complex reasoning tasks. Whenever we introduced a new capability, we rigorously tested that it did not negatively impact existing functionalities. Beyond assessing outputs, we also analyzed the agent’s reasoning trajectory — how it arrived at answers. This remains a complex, evolving challenge, and accelerating the path from prototyping to production represents a major opportunity in the field.
What’s Next?
✨ If you have been following BeeAI’s journey and want to help shape its future, we invite you to explore our public roadmap and contribute to the discussion thread on upcoming changes. We would love your inputs!