MotleyCoder: A code interaction toolkit for AI agents
Part 1.
It now seems common knowledge that LLMs can write code. Often, even good code. In recent months, we’ve seen a surge of AI-powered coding assistants and editors making their mark, quickly gaining traction among developers. Andrej Karpathy recently claimed that most of his code is now written by an LLM.
At MotleyCrew, we too decided to dive into the code generation world. Firstly, we wanted to get acquainted with a large and rapidly developing AI domain, where there are a lot of agent use cases. Second, our approach has always been rooted in addressing real-world challenges, and that’s where these use cases could really fuel us. In the end, it all led us to creating a versatile set of tools that form a kind of code editor for AI agents, which enables LLMs to navigate and edit code in a way they can comprehend.
What comprises a coding agent? First, we need a prompt. To have the LLM suggest edits to a particular project, we need an entry point: a view of the project that we can feed into an LLM.
Another essential part is to provide a way to suggest edits to the code.
Some projects, like Aider which inspired us, keep it simple and in a sense agentless by providing only the initial view (which depends on the user message) and the description of the format for providing edits. The LLM decides which files require modification (the user can also specify them explicitly), and their full text is fed into the LLM so it can propose necessary edits.
Our goal, unlike Aider’s, was not to build an interactive coding assistant (there are plenty of those already, and they’re quite good). Instead, we wanted to explore the idea of autonomous coding agents that can be embedded into various systems, and provide developers with a starting point for building such agents.
We settled on the following essential elements of a coding agent:
Repo map: the initial high-level view of the codebase. With a combination of static code analysis and retrieval techniques, which we’ll discuss in depth in the next part of the series, we detect parts of the code that are relevant to the current task and add them to the prompt. We used Aider’s codemap implementation as a foundation, with some changes in graph building and retrieval.
Just to give an idea of how it looks, here’s a part of a repo map for the TaskUnit class in motleycrew:
Code inspection tool: a tool that the LLM can call to inspect a part of the code it’s interested in. It can specify a filename, a name of an entity, or both.
If the output is very large, sub-entities are collapsed so that the LLM can request them separately if it wants to.
Using this tool, the agent can also list directories and read whole files.
File editing tool: tool for actually modifying code files. Though the idea is simple, the actual design of this thing can be tricky. What format the model should use?
The most straightforward would be to just write the whole content of the new version of the edited file. An obvious downside to this approach, besides high cost, is that the LLM can make mistakes in parts of the file it didn’t intend to modify.
A great editing format devoid of these problems is diff format, or edit block format, where the model proposes edits in a search-replace way, providing the code block that needs to be replaced in a file, and the block that should replace it. This is the format Aider uses for GPT-4o by default, and the one we’ve settled on. It seems reliable because the model needs to reproduce the entire block of code it is replacing, thus minimizing the risk of missing some important logic.
Another way could be to specify the line range and the code it should be replaced with. We did not test this one, but if the LLM makes mistakes in line numbers, they would be hard to detect.
Also, using a linter is crucial for eliminating bad edits. MotleyCoder provides basic linting by parsing the code using tree-sitter, and also advanced linting for Python using flake8. Adding custom linters for other languages is also easy.
Running tests after the agent is done with the edits can also boost its reliability right away. This is a natural usage pattern for motleycrew’s output handler: the agent calls a special tool to mark the finish of the editing, and the tests are run inside that tool. If the tests fail, their output is fed back into the agent so it can fix them.
Together, all these tools form a kind of code editor for AI agents. They provide means to navigate and modify the codebase, as well as linting and testing on the go, like an IDE would do for human developers.
By their nature, these tools can be given to any agent. We found that motleycrew’s ReAct tool calling agent handles coding tasks really well, thanks to its great reasoning capability. Careful prompting is also crucial, so we included a set of prompts that we’ve found to work well with the tools.
There are downsides to this kind of agentic approach, too, of course. First, it can be token-hungry. Every tool call is followed by another request to the LLM with all the previous history included, and it gets quite heavy if it contains many code listings. This can be addressed, for example, with summarizing or otherwise collapsing older history.
Another flaw is that nothing really stops the agent from wandering through the codebase forever. Although unlikely, this is possible if no restrictions are set on the number of tool calls. For now, we partly address this by blocking repeated inspection of the same entity and limiting the total number of agent iterations, but more sophisticated criteria will probably be required for production applications.
The result
Our work resulted in a convenient set of prompts and tools for interacting with code. Together, they can form a really autonomous coding agent, but they also can be used separately. For example, just the repo map and the inspection tool can be used together to create a code explainer or a bug finder agent.
Check out the repo, which also contains a demo notebook with the whole thing assembled: https://github.com/ShoggothAI/motleycoder.
We are going to continue improving MotleyCoder in our spare time to make it more reliable and capable. Also, we plan to evaluate our agent against the SWE-bench benchmark to see how it compares to the variety of coding assistants out there.
Stay tuned for the next part of the series, in which we’ll take a look at how we construct the codebase map using static code analysis and some tricks.