Generative AI Prompt Engineering: A Balancing Act

Lexxi Reddington
Slalom Build
Published in
11 min readMay 21, 2024

How to improve your prompts for higher-quality outputs

Generative AI Prompt Engineering

As the popularity of generative artificial intelligence (GenAI) has exploded, so too have the examples of popular GenAI models producing rather unexpected responses. In one well-known case, a lawyer relied on a GenAI tool to prepare a legal brief and found themselves in hot water when the brief cited case law that did not actually exist. In another situation, tweets were used as training data for a Microsoft GenAI chatbot which quickly began returning racist, anti-Semitic, and misogynistic responses.

Clearly, GenAI is in its infancy, and we must take great caution to avoid these AI disasters. To achieve more desirable results, prompt engineering can be a powerful tool.

What Really is Prompt Engineering?

Prompt engineering is a process by which humans guide GenAI models by providing specific instructions (input text) that enable these models to produce higher quality and more relevant responses.

For example, when using a GenAI solution to learn about healthy food, one might ask, “What foods are healthy?” The response may be helpful but will likely be generic and lengthy. In this case, the output might be a list of foods with different nutritional benefits. Prompt engineering can help to improve on this.

By providing a more detailed and explicit input, the output can be tailored to be more relevant. Instead, one might ask, “How do you determine if a food is healthy?” Here, the response may be more detailed with information about nutrient density, fiber content, and processed ingredients. This shows how the process of creatively changing the inputs with trial and error can yield different outputs from the same model.

In the context of this article, I am specifically referring to prompt template engineering where one can define and compose AI functions using plain text. An example of this is Jinja. Jinja is a templating engine that has placeholders for prompts written in plain text and allows code to be written with syntax similar to Python. This is especially useful for integrating the natural language prompts for GenAI into typical code infrastructure. That means the software development lifecycle still applies where it’s important to maintain version control for the prompt templates and undergo code review and more robust testing. It is a powerful tool for adding new GenAI techniques to the already familiar software development process.

Using Jinja and the example of determining healthy foods, it could look like this:

A Jinja template example

In the realm of software development, prompt engineering serves as a crucial bridge between human guidance and the capabilities of GenAI models. This is especially the case within software projects.

Understanding how prompt engineering operates is essential for harnessing the full potential of these models. Take, for example, the integration of template engines like Jinja2 into the prompt engineering process. By using such tools, developers can seamlessly embed natural language prompts into their code infrastructure, ensuring that the software development lifecycle, including version control, code review, and rigorous testing, remains intact.

This approach not only acknowledges the tangible work involved in prompt engineering but also underscores its significance in the broader context of software development practices. This is because prompt engineering and template utilization extend beyond mere text manipulation (or just typing and pasting into ChatGPT, for example); they serve as foundational components for building robust software products that leverage large language models and GenAI effectively.

For instance, within chatbot development, templates facilitate the creation of predefined responses based on specific user inputs, ensuring consistent and contextually relevant interactions. Similarly, in content generation applications, prompt engineering guides the AI model to produce tailored outputs by providing structured prompts that encapsulate the desired content parameters and stylistic preferences. These practices not only streamline the development process but also enhance the quality and coherence of the final product.

By leveraging prompt engineering methodologies, software developers can navigate the complexities of integrating AI technologies seamlessly into their applications, ultimately delivering more personalized and engaging user experiences. In essence, prompt engineering acts as a linchpin for translating user requirements into actionable instructions for AI models, thereby fostering innovation and driving tangible value creation in software development endeavors.

However, prompt engineering is not always so straightforward.

I experienced this firsthand during the 2023 Slalom Build Hackathon. Our team developed a product called Easy Reader, a Chrome extension that allows users of various levels of literacy to simplify text on a webpage using GenAI. The product vision was to create an easy-to-use Chrome extension that processes text and images on a website through Azure OpenAI to simplify it based on a pre-defined set of complexity levels. This vision was guided by our mission to make websites more accessible to audiences who might otherwise have been unable to understand the information on a page due to content or language complexities.

Users self-identify with an audience to get more targeted prompts and relevant results. The audience options are:

  • Grade School: Kids or teenagers who want to make content easier according to their expected grade level.
  • General Population: Adults who are looking to simplify content into terms that most can understand.
  • Learning English: People who are learning to speak English as an additional language.
  • Professional: People looking to summarize and simplify complex text but keep advanced words or industry jargon.

In the initial stages of development, once we enabled the Chrome extension to process text and images, we passed along the processed data to OpenAI with a prompt like:

“Given this text body, simplify the information for professional people looking to summarize and simplify complex text but keep advanced words or industry jargon.”

This prompt actually worked rather well for simplifying materials for the professional audience. The outputs were reasonable and little tweaking was required on our end to improve the summaries for this specific audience; however, the same cannot be said for the other three audience options…

Several Pitfalls of Prompt Engineering

For clarity, Easy Reader was created as a proof of concept, so testing began as purely manual. In the future, it would be ideal to greatly expand on our tests, especially in the realm of automated regression testing in a CI/CD pipeline. Unfortunately, for now, testing the prompts and responses returned for Easy Reader means manual testing by our developers.

For the Learning English audience option, we began by passing in a simple prompt like:

“Given this text body, simplify the information for people who are learning to speak English as an additional language.”

When we compared the original articles to the summaries output by OpenAI, we noticed little difference. The articles were often not summarized, or if they were, they were not useful to people learning English because they retained complex words and syntax structures. The problem here is that GenAI has contextual understanding limitations.

GenAI uses data-driven algorithms to recognize patterns within the training datasets, which allows it to produce new data based on those recognized patterns. This results in difficulties understanding the context around scenarios outside of the training parameters because GenAI lacks the ability to genuinely understand concepts. Currently, this is normal. it should be expected that these systems struggle with nuanced or context-dependent queries and may provide answers based on superficial patterns in the data that humans wouldn’t expect.

To mitigate this deficiency, the prompts should provide their own context to inform the desired responses. For example, to improve the prompt for the Learning English audience option, we needed to provide context around what it means to be a non-native English speaker and what modifications are helpful to them in understanding English materials. We found the best output by changing the prompt to:

“Given the following text, summarize and rewrite it so that someone who does not know English as their first language can understand it. Match the text’s complexity to the C1 CEFR level. Maintain the original format.”

CEFR, short for Common European Framework of Reference for Languages, is an international benchmark for assessing language proficiency. The CEFR organizes language proficiency into six levels: A1, A2, B1, B2, C1, C2. The two A levels are classified as Basic User; the two B levels are classified as Independent User; the two C levels are classified as Proficient User. By requesting that the text complexity of the returned summary match a specific, pre-existing standard, OpenAI was able to successfully reference the standard and produce better summaries for people learning English.

We encountered another GenAI pitfall with the General Population option. We began by passing in a simple prompt like:

“Given this text body, simplify the information for adults who are looking to simplify content into terms that most can understand.”

This frequently caused unexpected interpretations. Depending on our phrasing, we got wildly different results. For example, small differences such as using the word “people” instead of “adults” changed the summaries produced. Using “most can understand” instead of “someone in the general public can understand” also changed the summaries produced. Sometimes these summaries were simplified versions of the originals, sometimes not. Sometimes these summaries had extremely complex words left unchanged, sometimes simple words were changed.

Overall, in testing the prompts for the General Population option, it became clear that GenAI is quite sensitive to phrasing. This sensitivity is an artifact of using large language models as the backbone of GenAI to produce new data with similar characteristics to its training data. OpenAI is a complex system trained on a massively broad dataset; therefore, small differences like “people” versus “adults” often mean different things or cannot always be substituted for each other in different situations, so their inclusion or exclusion in a prompt can completely change the output.

The solution here was trial and error. Different phrasings were tried and tested on various articles until we were satisfied with the outputs and saw consistent results. The process was time-consuming, but we ultimately settled on using a combination of the Federal Plain Language Guidelines and the Dale-Chall list. The Federal Plain Language Guidelines are the official guidelines for the Plain Writing Act of 2010 in the United States. This act requires government documents issued to the public to be written clearly, and the Federal Plain Language Guidelines establish what that means. The Dale-Chall list contains approximately 3,000 familiar words that are considered not difficult. Fourth-grade American students should know 80% of the words on this list. The resulting prompt was updated to be:

“Given the following text, summarize and rewrite it so that the general population can read it according to the Federal plain language guidelines. Maintain the original format. Replace any words not found in the Dale-Chall list. Reduce sentence complexity and length for readability.”

This combination of frameworks worked well to guide OpenAI in producing the summaries we were looking for — but it was only through trial and error that we discovered this. Other framework combinations did not work well, nor did using the Federal Plain Language Guidelines or the Dale-Chall list alone.

The final pitfall we encountered during our prompt engineering journey was with GenAI handling complex requests. We first attempted to summarize articles for the Grade School option using a prompt like:

“Given this text body, simplify the information for kids or teenagers who want to make content easier according to their expected grade level.”

The user could select their grade level, and the complexity of the summary produced would adjust accordingly. Unfortunately, the output was undesirable. We found that this audience option was the most difficult to prompt engineer because of its complexity. There were misunderstandings about what grade levels were, what language ability was expected at each level, and to what extent the summaries should be vague or verbose. The solution required providing context and utilizing trial and error like with the other audience options, while also describing the specific format we expected for the output. We modified the prompt to:

“Given the following text, rewrite it so that a 7th grader could read it and understand it according to Flesch-Kincaid level 7 to 8. Try to match the original format.”

Flesch-Kincaid Grade Level is a metric for the readability of a text based on how difficult it is to understand and the length of the words and sentences used. We also told OpenAI to respond as though they were role playing as a teacher in a classroom who is trying to explain course materials to students. This context mixed with specificity on the format of the responses achieved better summaries than our original prompt.

Advice for Other GenAI Developers

Developing with GenAI systems has highlighted the importance of balance in drafting prompts. Prompts that are too simplistic or overly complicated can reduce the quality of the outputs generated. From our Hackathon experience, we saw contextual understanding limitations, sensitivity to phrasing, and difficulty understanding complex requests. To improve the prompts:

1. Provide the Prompt Context

To clarify the prompt’s meaning, use existing standards for reference and provide examples for which the GenAI system will already have data.

Various tools are also available to assist in mitigating GenAI risks and improving prompts. For example, LlamaIndex is a data framework for large language model (LLM) based applications that require context augmentation. LlamaIndex offers abstractions to process domain-specific or private data for use in LLMs to produce better-generated outputs. This tool can provide additional context from desired data sources to GenAI systems (even if the system in question has no prior awareness of said data sources).

2. Test! Test! Test!

Things that end up working might not be what humans always expect first, so don’t hesitate to test, revise, and repeat until you’re happy.

Another useful tool that can assist in this trial-and-error prompt testing process is PromptPerfect. PromptPerfect is a plugin that works with text generation models to improve prompt quality and output consistency by allowing prompts to be input and then offering settings to adjust prompt length, output quality, number of iterations, etc. With these constraints, PromptPerfect produces prompts that can be edited and speeds up the prompt engineering process.

3. Describe the Output’s Expectations

It may not always be clear to the GenAI system what type of response is expected, so detailing the desired format and length of the output, especially based on pre-existing frameworks, is helpful.

One example of a useful tool to help with this is Amazon Comprehend, a natural language processing service that provides sentiment analysis, language detection, topic modeling, and more. This tool can also be used for toxicity classification, which enables prompts to be examined for harmful content and prevents the generated output from containing undesirable or unexpected language.

In addition to Amazon Comprehend, there’s another layer of defense available for securing LLM applications from potential risks and vulnerabilities: the concept of a “GenAI firewall.” Tools like Llama Guard or third-party LLM security vendors such as lakera.ai and arthur.ai specialize in safeguarding LLM applications against toxicity, jailbreaks, PII leakage, and other security threats.

While AWS services like Amazon Comprehend offer valuable features for evaluating outputs, dedicated LLM security tools provide tailored solutions specifically designed to fortify applications utilizing GenAI. Incorporating a GenAI firewall into the development pipeline adds an extra layer of protection, ensuring that AI-generated outputs adhere to necessary security protocols.

Overall, GenAI has exciting capabilities; avoiding a few common pitfalls and knowing how to tweak prompts with the aid of helpful tools can dramatically increase its utility.

--

--

Lexxi Reddington
Slalom Build
0 Followers
Writer for

Lexxi is a senior software engineer at Slalom Build.