Building AI-powered software engineering tools: Essential technical considerations for founders

Published in

Innovation Endeavors

15 min readMay 30, 2024

In 2022, Github caught lightning in a bottle by releasing Github Copilot, their AI coding assistant. Today, over 37,000 businesses–including a third of the Fortune 500–use the product, which can reportedly help developers code as much as 55% faster. This, however, is only the tip of the spear in terms of what’s possible when AI is applied to software engineering. Many aspiring founders, themselves engineers, want to get in on the ground floor of this industry revolution by bringing products to market that drive productivity gains across the SDLC.

We spent several months meeting researchers and companies pushing the boundaries of AI-enabled software engineering, from code generation and testing to migrations and beyond. In this two-part series, we will share our guide for building AI-powered developer tools and discuss some areas in particular that we are most excited to see disrupted by this technology.

In this post, a handbook for founder CTOs, we’re covering some common design patterns and their tradeoffs, as well as key engineering challenges and current methods for addressing them.

If you’d like to dive more into the business model questions that influence how companies are built in this space and a few opportunities we think could lead to big companies, check out “Building AI-powered software engineering tools: Essential commercial considerations for founders”

Common design patterns and tradeoffs

We’ve seen a number of design patterns emerge that serve as building blocks for bringing AI into software engineering tools, each with its own benefits and tradeoffs.

Solo programming versus pair programming interaction model

One core design decision founders will make is regarding the interaction model between humans and AI in their product. We see two common modalities, which we call the solo programming approach and the pair programming approach. In the solo programming model, AI acts independently and receives feedback and guidance as if it were just another human engineer. The typical implementation we see is an AI agent opening pull requests or issues into a repo while engaging with and responding to other contributors. See here and here for examples of an AI bot working collaboratively but independently to close issues.

In the pair programming model, AI works hand-in-hand with the user to achieve a shared goal, which usually means working simultaneously on the same file. This is the interaction model you experience in AI-enabled IDEs like Replit, Sourcegraph, and Github Copilot. You can see an example below from Replit AI. Another potential implementation of pair programming is a chatbot, where the user converses with the AI to help it refine a generated snippet of code.

In our view, the solo programming model has greater upside potential for creating developer productivity gains because full autonomy implies offloading 100% of planning, structuring, and writing code to the AI. The drawback is that providing feedback to an agent like this is cumbersome, given that the primary channel for this feedback is typically pull request comments. By contrast, feedback to a pair programming “autocompletion” happens in the flow of composition with quick iteration cycles. People are accustomed to giving feedback in this format, as evidenced by new research suggesting that users are satisfied as long as the bot gives them a good starting point. The drawback of this approach is that there is a ceiling on how much productivity improvement a pair programming form factor like this can bring, as humans will always need to be kept in the loop.

How do you decide, then, if your product should leverage solo or pair programming? The answer lies in the value proposition you want to offer. If your core pitch to customers is that you can handle a task they otherwise would not undertake (like migrations or tech-debt cleanup), then it is essential that their product experience is a magical AI bot that pushes code to solve their problem. There may be little patience on the part of customers to provide in-depth feedback and human-in-the-loop supervision. After all, your customers didn’t want to do that task in the first place, so the bot should not create more work for the user than absolutely necessary.

If instead your value proposition is around providing a speed boost to workflows that engineers inexorably must perform (such as optimization or unit testing), we think a pair programming experience is likely the right fit. Engineers may be satisfied even with higher margins of error and more hands-on coaching of the bot as long as they get a quantifiable performance uplift on these frequent tasks.

Deterministic versus probabilistic code mutation

Most AI developer tools need to perform code mutation: making edits to lines, functions, modules, and files. Companies can approach this either deterministically or probabilistically.

Let’s consider the deterministic approach first. Under the hood, this involves leveraging sophisticated regex-based pattern matching algorithms (called codemods) that replace one string with another the same way every time. For example, here is a snippet from Grit’s documentation, that shows how to use GritQL to replace all console log messages with an alert.

Although we describe deterministic code changes as string matching plus replacement, there is a great deal of technical nuance involved. Grit, for example, deeply understands the abstract syntax tree structure of your code, allowing it to ignore matching patterns in, say, quotations or comments. While deterministic code changes are reliable, they require some upfront effort to program and configure, especially if there is branching or conditional logic. They are also not as “creative” or “adaptable” at LLMs since they are intended to be used to perform particular transformations on a particular pattern with high reliability and consistency.

Probabilistic approaches use AI to author code directly. LLMs are token-prediction machines, and their ability to code stems from the inclusion of source code in their training corpus. Hence, with an appropriate prompt, they are able to write text that looks like code and can compile. But given that the model is selecting the next token probabilistically, it is possible for AI to generate incorrect, vulnerable, or nonsense code as well. New coding models are released frequently and benchmarked against various evaluation suites. Leaderboards such as EvalPlus can be helpful pointers to keep up with the state of the art.

Most products, we believe, will converge to using a combination of deterministic and probabilistic approaches to perform code mutation. Teams will need to make a decision about how much of each to leverage based on a few factors:

What reliability can you achieve with probabilistic methods alone?
Which customers, if any, will pay for a product with that level of reliability?
To what extent are you willing and able to use deterministic methods to assist AI and increase reliability?
Is there sufficient market opportunity for the universe of use cases bounded by the above decisions?

It will be interesting to see how the industry balances the percentage of code mutations that happen via codemods versus AI across different use cases. We believe that the choice of deterministic versus probabilistic architectures for a particular application can lead to vastly different outcomes for competing companies.

Zero-shot versus agent-driven architecture

In a zero-shot (or few-shot) approach, an LLM receives a prompt, perhaps one that is enriched using RAG or other in-context learning methods, and produces an output, which might be a code mutation, docstring, or answer to a question. The below diagram from Google’s recent blog post on Vertex AI Codey offers a great visualization of how RAG works in a few-shot LLM approach (more on the chunking, embedding, and indexing in the next section).

Agents, by contrast, are multi-step reasoning engines that use a combination of zero-shot and few-shot LLMs to iterate toward a goal. This may involve any number of intermediate planning and self-reflection steps. For example, you could ask an agent to plan out a debugging investigation, including what root causes it should explore and which functions or files it should dig into. Then, at each step, you could ask the AI to reflect on if it has identified the bug or if it needs to engage in further exploration. Where agents become particularly powerful is when they can also leverage tools to either mutate the codebase (e.g. codemods) or build additional context by pulling from external sources like a language server. More on this idea of an agent-computer interface can be found in the SWE-agent paper by Yang et al., which highlights some of the considerations that you should make when designing an “IDE” intended for use by AI systems rather than humans.

One example of an agentic workflow can be found in the paper: “AutoCodeRover: Autonomous Program Improvement” by Zhang et al. In this case, the AI agent leverages both self-reflection and a toolkit, including functional primitives like code/AST search to build context and suggest repairs for GitHub issues. The agent is first prompted to use its various tools to retrieve relevant functions that could be the source of the bug. After retrieving context, the AI is asked to reflect on whether it has enough information to identify the root cause. Once the AI answers affirmatively, it generates a patch and is asked to validate if the patch can be applied to the program. If it cannot, the bot tries again. A diagram of the workflow is shown below.

To see an example of this idea taken even further, you might be interested in reading “Communicative Agents for Software Development” by Qian et al. which offers a deeper dive into a multi-agent system that has “CTO”, “designer”, “tester”, and “progammer” agents chatting with each other to decompose a task and accomplish a shared goal.

Whether a zero/few-shot approach is sufficient or you need to introduce LLM agents is more of a technical implementation detail than a design tradeoff. Agents may be harder to steer, but may lead to greater overall success given their ability to zoom in and out on particular subproblems within a complex task.

Human-directed versus independent planning

Planning is a critical part of the agentic workflow. AI agent products can be segmented along two product directions depending on if humans assist with the planning or the AI composes the plan independently.

As an example, on the human-directed side, we can look at Momentic in the end-to-end testing space. In Momentic, a higher level test plan is written by the user with very detailed instructions on testing procedures that are then executed with the help of AI. This is useful because testers care more about verifying that the application follows the correct intent (e.g., displaying the weather) and would prefer to not get hung up on particular assertions (“the value of the weather html element is 72”), which create brittleness. AI is quite good at implementing this sort of fuzziness. For example, in the workflow below, the AI can assert that we logged in successfully whether the page says so explicitly or displays a landing page that can clearly only be accessed by an authenticated user.

For an example of an independent planning approach, we can look at Goast.ai, which seeks to automate root cause analysis and debugging workflows by ingesting observability data and dynamically searching a codebase to do auto-remediation. As the diagram below shows, Goast uses a multi-agent architecture that involves (1) a context engine that retrieves semantic information from the codebase useful for exploring an RCA, (2) root analysis and implementation agents to perform the investigation and remediation, and most critically, for this section (3) a solution planning agent.

The question of whether to use independent or human-directed planning in your product closely parallels our earlier discussion on solo vs pair programming approaches: it depends on the value proposition you want to offer. In the planning phase, the decision about whether to incorporate humans-in-the-loop for plan generation should largely be based on the delta between the effort required to create a plan for your use case versus the time required to implement it.

For example, in testing, it takes significantly longer to implement a test case in a browser automation framework like Selenium versus articulating the higher level steps in natural language. In fact, QA teams generally want some level of control over the system and what actions it takes, so a completely AI-generated and executed test plan may be counter to the goals of the user.

By contrast, in debugging/RCA, the plan is the core task to be automated. While it can be challenging to figure out how to explore and prune the search space to zero in on the root cause of a bug, many times the fix itself is quite simple. And, if one seeks to create a fully autonomous SRE as the core value proposition, a human-assisted planning approach is counter to the goals of the product.

Technical challenges

Most companies building products in this category encounter the same technical roadblocks. Here, we outline a few of them and some common strategies that companies have been employing to address them.

Preprocessing and indexing

Although there have been a lot of recent advancements made in expanding model context windows to hundreds of thousands or even millions of tokens, many codebases are still far too large to fit in a single context window. Even if it were possible, it’s not clear that this approach itself is actually helpful to the model, as models can struggle to effectively use long context windows (a problem called “lost-in-the-middle”). Hence, there is a significant challenge that companies will face in pre-processing codebases so that at inference time the AI can parsimoniously retrieve the context it needs to answer a prompt. Because of how impactful preprocessing is to the rest of the AI stack, this is one of the key areas where companies in this space can differentiate technically.

Many companies will start by chunking the codebase into usable snippets and generating embeddings for them using a model like Voyage Code. These embeddings are then stored in a Vector DB so they can be queried later, essentially allowing the AI to pull the K most relevant code snippets for a given prompt and perform re-ranking. There is a lot of nuanced complexity in crafting a chunking and retrieval strategy because one could generate embeddings for files, modules, logical blocks, or individual lines. The more granular your embeddings, the more fine-grained, but also more myopic your retrieval. Pinecone provides some useful guidelines for thinking about chunking strategies. At a high level, you can think about chunking using a few different approaches:

Size/length: embedding a consistent number of characters or lines per document
Structure-based: embedding based on blocks (if/else statements, functions, for-loops), modules, or classes
File-based: embedding entire files
Component-based: embedding based on discrete components that work together in a logical unit, such as front-end components (auth, feed, profile page, etc.) or microservices

If you’re interested in a concrete example of how chunking can be done, take a look at this blog post from Sweep.dev, which talks about how tree-sitter can be used to recursively chunk codebases based on the abstract syntax tree. Updating or refreshing this index can potentially be a drag for large codebases, but Cursor has done some interesting work leveraging Merkle trees to efficiently update their index.

In addition to code, you can also think about embedding other kinds of files like documentation, product specs, and release notes. A knowledge graph that captures semantic relationships between concepts related to the architecture can be particularly helpful for these concepts. The best strategy, though, might be to combine multiple different indexing strategies for different granularities of search or retrieval. A component-level vector DB might be helpful for understanding how broad parts of the system function at a high level, whereas a block-level embeddings database might be helpful for identifying targets for code mutations. If you expose these as tools to an agent, it may be able to dynamically reason about which database provides the most relevant context depending on the current state of its exploration.

One way to augment your chunking strategy is to also pair this with a non-AI-based mapping of a codebase, including files, functions, dependencies, and call graphs. This context can be very valuable if you need to provide the AI with a list of libraries or functions available in the codebase, pull additional relevant/dependent code into the context window even if it didn’t appear in a vector similarity search, or locate and mutate files in the repo.

Validation and assurance

Success for most AI developer tools is dependent on their ability to contribute code that gets accepted into their customer repositories. Trust in the tool is a critical prerequisite for this, which is greatly improved when the product can make assurances around its code’s safety, functionality, performance, and accuracy. There are several techniques companies employ to this end.

Linters and static analyzers

One of the most basic techniques an AI can do before proposing a pull request would be to run a static analyzer or linter on the code. These tools help check syntax, style, security, memory leaks, and more. AI can be prompted to leverage linters as a tool for self-reflection, iterating until the linter no longer complains.

Testing

Comprehensive test coverage has long been a core acceptance criteria for development projects. Hence, it would make sense to require that any time AI modifies the codebase, it should ensure that all existing tests pass and propose additional tests that cover any new functionality being introduced. While this is great in theory, there can be some complications and limitations in practice.

First, there may not be an existing regression test suite for the part of the code you are touching. This could be due to poor development practices and hygiene, gaps in current test coverage, a completely greenfield feature, or the need to rewrite the entire existing test suite in a new language or framework (for example, during migration tasks). In these cases, the AI may first need to generate a test suite that accurately captures the intended behavior of the system, a difficult problem known as specification inference.

Second, test coverage itself is an expensive way of gaining a limited view into the correctness of your code. Test suites are not free to run, and metrics like code coverage alone are not enough to guarantee that the code has been thoroughly tested or is 100% functionally correct.

Finally, if AI is asked to respond to a failing test, it must generate a patch that provides a general solution to the bug instead of a hacky workaround for the particular test.

Formal Methods

Another less explored avenue for software validation is to employ formal methods like Model Checking, which provide strong logical guarantees on the behavior of finite state machines. Although these techniques provide more completeness and rigor, they become computationally very expensive and challenging to implement as your state machine gets larger. Hence, it may be impractical to use this technique (without significant levels of abstraction) for large codebases.

Human Feedback

The last line of defense against bad AI-generated code making its way into a repo is human review and feedback. As discussed in the previous section, AI tools can receive feedback through PR comments if taking a solo programming approach or in-line if leveraging pair programming. In some cases, a sandbox environment will need to be spun up to allow humans to click through applications end-to-end.

Conclusion

In this piece, we covered some of the core technical design decisions, trade offs, and challenges that product-builders in this space will face. One way to summarize these learnings would be to think in terms of two guiding questions:

#1: How much do you want humans to be in the loop?
Do you want to adopt the solo programming or pair programming model? If using an agentic approach, will humans or AI be doing the planning? How much human attention is needed to validate the code?
#2: How will you ensure the reliability and accuracy of your system?
Do you need to build deterministic codemods or are your AI guardrails and validation enough to ensure accuracy? If you need to rely on deterministic methods, does this impact your breadth of use cases or scalability? Is an agentic architecture more helpful or too challenging for what it’s worth? How does your ability to index or pre-process the codebase lead to better results?

Although the technical design principles discussed in this post form some of the building blocks of great products, they alone are not enough to create a big company, as the market dynamic plays a big role in the outcomes of any startup. For that reason, we shared a companion guide for founder CEOs that covers how to form a rigorous hypothesis about how an idea or problem statement can form the basis for a large, independent company and craft a business model to bring that product to market.

If you’re building in this space, or would like to discuss more about some of the ideas in this post, we would love to hear from you. Reach out to us at diyer [at] innovationendeavors [dot] com

We would like to thank many founders and experts who have helped us in this research, in particular: Momentic, Goast, CamelQA, PolyAPI, Nova, Flytrap, Second and Mutable, and Davis Treybig

Cover photo by James Harrison on Unsplash