Building with LLMs: Beyond prompt engineering

James Murdza
4 min readMay 19, 2023

--

Around the world, thousands of apps and services now depend on OpenAI’s APIs to run large language models such as GPT-3 and GPT-4. These apps use prompt engineering, which consists of combining and manipulating strings to provide to the LLM as input, as shown below:

Prompt engineering with a code sample. (example.py, example.ts)

This type of prompt engineering is the first step in developing LLM-based applications. It can produce impressive results, and there are many great resources on this topic, but if the goal is to get the best quality output, eventually prompt engineering gives diminishing returns. What other techniques can we use when we hit this wall?

LLM prompting paradigms

Improving output quality from LLMs can be achieved by making multiple prompts with an appropriate control flow. For our first example, say our goal is to make a program that writes an article or an essay. An initial prompt can be used to create the structure of the essay, and successive prompts can generate the essay in chunks.

Parallelization of LLM prompts

This example can produce more text than a single pass approach. Additionally, because prompts can be run in parallel, it is faster. If a first try does not result in the desired continuity between chapters, one can experiment with changes to the initial prompt that produces the outline.

Now let’s take more complex case. Say we want to complete a task such as writing a report which requires information on the internet. This task requires undefined number of steps such as using a search engine, clicking links, etc. An approach to this is to use the LLM to generate a list of tasks, and to carry out those tasks until the objective is achieved.

A simple LLM-based agent

As seen in both examples, through parallel requests and loops, LLMs can be used to assist in relatively complex tasks.

A word of caution: LLM safety

Because LLMs are single-input and single-output, there is currently no way to safely compartmentalize parts of a prompt. For example, if I ask an LLM to summarize an email, the LLM could be tricked into thinking that text in the email is actually a command from myself.

Even more dangerously—any commands, links, or URLs taken as output from LLMs could be also compromised. Therefore, we should always assume that prompt injections will be successful, URLs provided by LLMs could be tainted, and all code generated from LLMs must be run in a robust security sandbox.

Prompt injection in the wild

LLMs for code generation

The current de facto reference for code generation is HumanEval, a list of 163 Python test questions which created by OpenAI for the Codex model. The most important operating condition for evalulating any model with HumanEval is that the LLMs has not been trained on any of test questions. That’s why this dataset was created manually by OpenAI employees.

The chart below shows the HumanEval benchmarks using of a number of current code generation models:

Benchmarks of common models using the HumanEval framework

A key observation from this table is the large gap in benchmarks between open source and closed source models—a gap of over 30%! Not taken into account in this model is speed, which can vary greatly between models.

Regardless of the model, there are a number of techniques that have been shown to improve the quality of, such as:

  • Generate a solution, then look for potential issues and iterate on the solution.
  • Increase the model’s temperature setting while generating multiple solutions in parallel. Evaluate the results and pick the best one.
  • Use prompt engineering to nudge the behavior of the model, or to provide it with useful context for solving the problem.

In isolation, each of these improvements has been shown to increase performance considerably. The challenge is wrapping them together in a way that solves the big picture.

Using LLMs to automate software application development

All of the above leads to the challenge that I have been working on for the past several months—using an LLM to contextualize and write code within a larger codebase. In coming years, with the right combination of tools, a large amount of software development can be sped up by using LLM-based automation. If you’re interested to try out an early beta version of this tool, head to gitwit.dev!

--

--