An Agent-based approach to GenAI
Co written with Ali Hill along with support from GlobalLogic UK&I
At GlobalLogic we have been experimenting with an Agentic approach to Generative AI, one use case we have been focusing on is headless generation of unit tests for legacy applications. This blog will discuss our solution in more detail as well as providing an introduction to GenAI agents, the problems they can solve and their use cases as well as discussing the limitations of this approach.
What is an AI Agent?
Generative AI agent workflows let you perform complex tasks that can be executed by multiple agents, each of which performs a particular role. It removes the need for direct human interaction with a Large Language Model (LLM), such as ChatGPT. The AI agents ask and answer questions autonomously, to the point they coordinate themselves and improve their solutions, potentially generating a better answer throughout each iteration. The agents continuously improve their solutions. They do this by accessing different tools which are enabled and created by developers.
As an example, you could have a software developer agent build a website and a UX agent provide feedback on the solution, this can go through several iterations. Once the necessary feedback cycles have occurred and the solution has been improved on, this could then be passed to a QA agent to perform testing.
Agents can collaborate sequentially, hierarchically and also asynchronously. The automation capabilities provided by AI agents provides a new way of developing software. In the past, automation has provided a route from A to B with predictable inputs and outputs. With the automation capabilities provided by AI agents, a number of different conditions and solutions can be provided. This is also the value of using AI agents — in the example we’ve been working on at GlobalLogic of using AI agents to write unit tests, different codebases can be provided to our tool and different unit tests will be produced, meaning that the tool is adaptable. With Crew AI the 2 main execution processes are Sequential and Hierarchical.
Sequential:
In sequential order, the order in which agents run is predefined when assembling the crew.
The output from previous tasks is wired to the input of the next agent. In the diagram we show the task output from the previous agent being used but it can also specify the output from older tasks too. This Process is good when you want to guarantee a specific order of execution.
Hierarchical:
In hierarchical order, a manager agent is used to direct the order of execution, validation and delegation.. Tasks are not explicitly assigned to agents but rather the manager decides which agents to assign a task too. A manager can be auto generated by Crew AI or they can be custom created and have an LLM assigned to them. The manager will review the output of the agents and decide on the next course of action.
By utilising multiple AI agents, each with an assigned role, AI can reason through complex problems, create plans to solve these problems and finally execute these plans for developers. These agents are implemented using a framework, such as CrewAI or Semantic Kernel, to develop workflows on any LLM and cloud provider.
Why do we use AI agents?
We believe that AI agent workflows will be a key driver of AI progress in the near future. AI agents provide a number of benefits. We’ve outlined a few of the key ones below:
- Having agents provide fast feedback is invaluable. Although fine-tuning of role description, agent permissions and an analysis of the output is required, the ability for the solution to be produced quickly can accelerate the development process and provide a solid foundation for developers to build upon.
- Collaboration and iteration between agents improves solutions. Agents will cooperate and collaborate through an iterative feedback process. They simulate a conversation between two different roles — for example a developer agent and a QA agent.
- Viewing the conversation can provide valuable insights. Tools such as CrewAI allow you to view the conversation flow between agents. This can provide developers insight into the way the solutions are being produced.
- The flexibility to define roles that are valuable to your team. For example, if a senior developer is required, the role description provided can include almost any programming language and required skills, giving the developer control over the type of solution produced.
- Specify which tools the agents use to improve solutions. By writing functions that provide access to tools, you can ensure that your agents are using only the tools that you define. You can also limit the actions that the agent can perform with these tools to avoid the agent taking any unexpected actions.
- Separation of concerns with each agent. With each agent having their own defined role, the tools and access they require access to may differ. By using multiple agents, you can limit agent access following the principle of least privilege.
When Not to Use GenAI Agents
While we believe agentic GenAI can solve a number of development problems, there are some areas where we feel it might not be useful or may be overkill to implement.
For small applications, where AI is not the main driver of results, agents will probably add little value. Think of an application that makes simple predictions of future values based on previous values. It is unlikely that multiple agents talking to each other to make predictions are required. A simple AI or ML algorithm would probably suffice.
Applications required to run at high speed may not be suitable for a multi-agent approach, as they can take some time to initialise and run. Agents will also simulate human conversation which might needlessly complicate an application where time is of the essence. A high frequency trading application for example might not be able to tolerate multiple agents feeding back information to each other and waiting for responses. An ML algorithm that makes predictions more quickly with a more focused task may be more appropriate.
When computing resources are limited and cost is an important factor, running multiple agents might also be a bad choice. Applications running on hardware with limited CPU and RAM may not be suitable for a multi-agent approach.
What we’ve learned about agents
There are a number of considerations when attempting to create a useful AI agent.
Provide a role for the AI agent to play
The role that you define for your agent makes a huge difference to the responses you get from your agents. You need to provide context for the agent to stick to. As a basic example, asking ChatGPT about a coding concept would provide you a different answer than if you asked ChatGPT about a coding concept and provided it the context that the answer should come from a senior developer.
Give a focus for the AI agent’s role through tooling
Providing focus reduces the risk of hallucination (false or misleading information). As an example, you can provide each agent access to only the tools required to do the job they are being asked to do. Adding more agents is a better solution than overwhelming one agent with access to too many tools.
Ensure agents can collaborate with each other
The ability for agents to talk to each other in the roles is extremely important. The same way you as a human would interact with tools such as ChatGPT can be simulated using AI agents. The simulated conversation provides a better result as the agents take feedback from each other. You should ensure your agents are set up to collaborate with each other in either a one-way communication stream or a feedback loop.
Define scope for each AI agent
As AI agent output is fuzzy and difficult to predict, it is important to explicitly define the scope of each agent. You can provide specific prompts to prevent tools, such as CrewAI, from taking a long time to run, for example. Being explicit with defining scope can also help to ensure you receive reliable and consistent results from your multi-agent tools.
Set the memory for AI agents
By memory we mean the ability for the agent to remember the tasks it has performed in the past. You can then use that memory/data to inform new decisions and new executions. Similar to humans, AI agents can remember what they did in the past, learn from it and apply this knowledge to future executions. It is important to choose wisely between agents using short-term (run only) and long-term (stored in a database after runs) memory as this can affect agent behaviour.
GlobalLogic’s Work with AI Agents using CrewAI
With all the hype around AI, we find that our clients are more frequently asking about its uses. Specifically, how and where it can be applied to solve real business problems.
One particular Financial Services client has expressed a desire for help in writing unit tests for their legacy code base lacking unit test coverage. We decided to apply our knowledge of multi-agent AI in order to provide a solution to this problem and create a unit test generator tool.
The multi-agent aspect of this solution is built using CrewAI. CrewAI is a framework and platform that breaks different concepts into simple structures, making it simple for developers to pick up and use. It provides patterns to put these structures together and is opinionated on that. It provides many tools that can be used by agents and it also provides modules allowing you to build custom tools and agents. It provides a choice of platforms for deploying your code.
We have recently been researching CrewAI and have learned how agent based GenAI can autonomously increase unit test coverage for legacy code bases. We will now give an in-depth overview of our solution.
Unit Test Generator
- Using CrewAI we create a crew of two agents, a developer agent and a code review agent.
- The developer agent is provided the role of an application developer and the goal is to to write clean, readable unit test code.
- The developer agent is also provided the context of being a developer with over 5 years experience. We state experience because it helps build the character of a developer and encourages the use of best practices as opposed to just writing functional code. This is a useful context that the LLM uses to understand how to approach the task.
- The code review agent is essentially a senior developer that will review the work of the developer agent. We characterise the senior developer in the background and profile of CrewAI. It will review the code and provide feedback to the developer agent.
- The developer agent will then take this feedback and improve on the tests written. This process will follow several iterations until either all tests are passing or until the agents cannot provide any further feedback.
- Each agent is assigned an LLM which will be the AI driving the actions of the agents. We use Amazon Bedrock as an LLM platform and invoke it via Python’s boto3 library. By using Bedrock we avoid managing the infrastructure of hosting an LLM — defining memory usage etc. We get the added benefit of having a choice of different LLMs to experiment with, both AWS created and third party.
- To provide a simple UI we used Streamlit. Streamlit is written in Python and provides a chat style interface. It allows rapid prototyping in order to quickly demo solutions.
With this design we have been able to produce unit tests for an existing Java Spring Boot project. The unit tests reference project specific classes and perform assertions using the JUnit framework.
Lessons Learned and Limitations
LLM Selection
Deciding which LLM each agent will use can have a big impact on the behaviour of the software you are building using Agentic AI. One lesson we have learned is that not all LLMs will behave the same way, even if the prompts are identical. The nature of AI means that LLMs behave in a non-deterministic way, so it is important to do a comparison between the different LLMs available.
Our unit test generator tool was initially built using Anthropic Claude 3.5 Sonnet and was running well, producing good unit test coverage. A Financial Services client had access to GPT4.0 running on Azure OpenAI and wanted to see if they could achieve comparable results rather than Anthropic Claude and we saw a noticeable change in behaviour.
We found that GPT4.0 was more unpredictable than Anthropic Claude. For example, the agent using GPT4.0 began changing production code in order to fix what it perceived to be bugs. This behaviour definitely wasn’t desired as the agent lacked context and these changes may have been a high risk to the application. Explicitly telling it not to do this in the prompt fixed this issue.
We are in the process of performing a comparison between different LLMs and their strengths and weaknesses. We believe that by understanding the personalities and goals of our different agents we can utilise the strengths of different LLMs to produce better outcomes.
CPU, Memory and Energy Usage
When creating multiple AI agents the CPU, memory and energy requirements can be considerable. This can lead to issues around cost, scalability and environmental impact.
Figure 1 shows a projection of TWh energy consumption for GenAI hardware. If data centre scaling and demand continues on its current trajectory it will outgrow the total European Energy consumption of 2022 by 2037. It is therefore important to think about the necessity of GenAI and if it will bring the productivity benefits for an organisation’s specific needs vs energy consumption. Agent based AI will compound this issue due to the multiple interactions taking place in each workflow.
Due to above costs and impact, it’s worth considering what the application is going to be doing and whether or not it needs an agent based approach. Are feedback loops needed? Is there a need to perform multiple tasks that all require different AI roles? Do you want an application to decide for itself which tool to use, or can these be defined up-front?
After considering the above questions and deciding that the application doesn’t need multiple agents, a GenAI library that doesn’t use multiple agents might demand fewer computing resources. This could reduce costs and impact.
High energy usage isn’t specific to multi-agent AI, GenAI requires models to be trained. For LLM’s, this requires a high amount of computational effort and energy. As organisations face increasing scrutiny on their carbon footprint, it is important to think about the environmental cost of using agent based GenAI and whether or not it is required for an application.
Long Running Feedback Loops
If agents delegate and provide feedback to each other this can potentially create a situation where a particular task may move back and forth in a long running feedback loop. This could make it run longer or possibly cause your agents to timeout.
In our unit test generator, we faced this issue when we created a developer agent and a code review agent. The developer agent wrote unit test code, taking an existing code repository as input, and the review agent gave feedback. The developer agent would then refactor the unit test code based on this feedback.
What we observed initially was a back and forth where the developer constantly made improvements and the reviewer constantly made more suggestions. This occasionally caused a timeout with the agents. It is important to understand why prolonged feedback happens and how it can be mitigated.
We mitigated this by turning off tools caching in CrewAI to avoid code generation that contained errors from being repeated and thus triggering another round of feedback. We also try to strike a balance between agent autonomy and giving specific instructions on how to perform tasks. As an example we explicitly specify that only code should be returned. This is to ensure the crew (team of agents) avoid wasting time correcting failed unit tests because of comments and suggestions the LLM may make about the tests. This is a reasonable restriction to specify as a human programmer would do the same.
Application of Principle of Least Privilege
If an AI agent is assigned tools and given the autonomy to use those tools as they see fit, it is possible that the agent may use them in ways that cause damage to your environment or corrupt data.
As an example, giving an agent the ability to use PIP (Python package installer) may cause agents to install Python packages that are not required or uninstall important packages. Also, a tool that connects to a database should have its permissions limited to the tasks it needs to perform otherwise it may corrupt data.
It is important to make sure agents don’t perform dangerous actions. This can be managed by making sure tools adhere to the single responsibility principle. Rather than giving an agent PIP as a tool, give it a specific pip command like, “pip install -r requirements.txt”. If the agent is only reading the database then giving it read only access to specific tables or a view would be safer..
Make sure the service principal or user account that the tool uses has minimal access needed to perform the application functions.
Conclusion
We believe that Agent based GenAI has many valuable contributions to make to software development in the near future.
Agents that can generate the foundations of entire applications, which a human can then build on, have the potential to not only save time but also utilise best practices and design patterns which a human developer may not immediately apply. As best practices change, the LLM is updated, and so are the applications being generated by the multi-agent GenAI tool.
As different LLMs from multiple vendors have a range of benefits and drawbacks, an agent based approach can make the most of a selection of them. As an example, if you have an LLM that specialises in website development you could have a developer agent assigned to utilise it. A tester agent can then be assigned to an LLM that is more suited to QA. Another agent can then be assigned an LLM that is more focused on business analyst tasks. These can then all work together to produce an optimal solution by leveraging the strengths of each LLM.
This could facilitate the rise of agent based teams which can assist in providing a range of services such as application development, report writing, product design etc.
We see the agent approach providing a separation of concerns for a much larger Gen AI application to automate or assist with thinking tasks and designing solutions. Domain specific models can therefore increase efficiency as they will be less likely to hallucinate results that are irrelevant to the industry space.
It is important that developers apply the use of GenAI agents wisely due to the high usage of CPU, memory and energy and potentially consider other options.
We’re continuing to investigate and explore where Generative AI can enhance quality and productivity at all stages in the Software Development Lifecycle and are excited to see what the future holds. If you’d like to hear more about our work please get in touch.