Evaluating and Improving GenAI-Based Products

Michał Koźmiński
Beekeeper Technology Blog
6 min readAug 29, 2024

You’re a software engineer, and your task is to write an article about prompt fine-tuning. This is an example of a prompt you might use to generate an article similar to this one. In a real application, it might look more like this:

Summarize the text by highlighting key points and main ideas in a concise manner. Prioritize clarity, include essential details and examples, and eliminate redundancy. Adjust the summary length based on the text’s complexity, ensuring a balanced and accurate overview. Use bullet points or numbered lists for clarity if needed.

In this article, I’ll explain our approach to evaluating the results returned by LLMs and the methods we use to test them and improve the user experience.

Adjusting prompts to yield the best results, often referred to as “prompt engineering,” involves employing various techniques to enhance the effectiveness of prompts. Among the most notable methods are the “chain of thought” approach and “X-Shot.” For a deeper dive into these strategies, I recommend further reading available here.

However, gauging the result of a prompt is not straightforward. We can’t achieve this with a simple assert (condition) function. So, how should we approach testing its efficacy?

LLM and Testing

To begin, let’s discuss common testing methodologies and why traditional procedures may prove challenging in this context. Typically, testing involves verifying if, given different inputs, the output satisfies our business assumptions, formulated as:

f(x) ⇒ y

In scenarios involving databases or other external factors over which we lack complete control, we create snapshots of the state. These elements might be integrated into the testing framework as follows:

f(x, state_of_the_world) ⇒ y

Here, the “state of the world” is predefined, allowing us to observe how the result changes depending on the external state. The system is still deterministic, making it easy to test.

With Large Language Models (LLMs), the luxury of testing every conceivable edge case is not achievable. Additionally, we encounter additional variables: temperature and top_p.

f(x, temperature, top_p) ⇒ y

The temperature setting, which can be any number but is typically restricted to a low range like 0–2, plays a crucial role. A temperature of 0 leads to more deterministic model behavior, ensuring consistent responses to identical inputs. In contrast, a higher temperature value encourages creativity, making less predictable outcomes more likely by flattening the probability distribution of potential responses.

Top P, also known as nucleus sampling, is a common technique for sampling the highest probability token. A higher top_p value indicates to the model that it can sample from a larger pool of possible tokens, making it more creative and unpredictable.

Graphic representation of Top P and Temperature

Starting with GenAI tests automation

The first need for automating LLM testing arose from our unit tests. We started by mocking the LLM adapter and checking the entire workflow with a fixed response from the AI. This allowed us to test all layers, from the REST controller down to the LLM adapter. As you can predict, this left us with blind spots downstream in our processing pipeline. So how did we solve this problem? The solution came from using NeMo Guardrails, which utilizes LLMs to evaluate the inputs or outputs of a chat.

With this approach, we could start testing whether the actual evaluation of our prompt and context yielded satisfactory results. Using this prompt, we evaluated whether the response for summarization contained all necessary facts, allowing us to add it as part of the test suite for our project, thereby removing all the blind spots:

You are given a task to identify if the hypothesis is grounded and entailed to the evidence.
You will only use the contents of the evidence and not rely on external knowledge.
Answer with yes/no. “evidence”: {{ evidence }} “hypothesis”: {{ response }} “entails”:

This prompt was passed to JUnit as part of testing the LLM to determine if the prompt and model we used returned valuable results. This results in code that looks like this:

void createdSummaryContainsAllTheDetails(){
LLMServiceResponse response =
summaryService.summariseText(TEXT_TO_SUMMARY);
Boolean passesCheck = llmValidateHelper.checkSummary(response.text, TEXT_TO_SUMMARY)
assertTrue(passesCheck);
}

Further LLM response evaluation

Evaluating features based on large language models (LLMs) often involves collecting user feedback through simple indicators such as thumbs up or thumbs down. At Beekeeper, our approach to prompt optimization was guided by several key assumptions:

  1. Maximizing automation,
  2. Compatibility with multiple LLMs (for us, it’s OpenAI GPT, Claude, and Mistral)
  3. Incorporation of user feedback into the refinement process.

Our strategy revolved around evolving the initial prompt based on customer feedback, aiming for an automated process that could enhance user experience at scale. This involved analyzing responses beyond simple approval or disapproval; we wanted to understand what users felt was missing.

Free-text feedback allows us to better understand what people didn’t like about the response they received. Perhaps we missed something important in the summary, or some of the tasks were wrongly assigned? Using only binary yes/no feedback would omit that detail.

This feedback was then stored in a central database, allowing us to refine responses through a combination of automated processing using LLMs and human verification, ensuring that the feedback accurately reflected user needs. The general workflow oscillated between automated and manual stages, underscoring the importance of both in achieving optimal outcomes.

This method, while effective, is not foolproof and can generate false positives. To mitigate this, we ensured our data pool was sufficiently large, reducing the impact of outliers on our overall results.

In the next post from this series, we will look into exactly how we address these challenges.

Journey Towards Prompt Optimization

Understanding what constitutes a “good” response paves the way for automated prompt evaluation. The challenge lies in creating a mechanism that can impartially assess the quality of prompts, particularly for tasks like summarization, which require the model to focus on the most pertinent information. The significance of this information can vary widely depending on the specific use case or context, making gauging quality susceptible to the personal feelings of the user to whom the data was presented. So how do we handle that?

Our approach began with establishing a clean dataset, further enriched by user-contributed data. Each LLM feature was then specified in terms of expected outcomes, allowing us to formulate heuristics for assessment. For instance, a summary should not exceed a certain length relative to the original text and must include specific details relevant to the target audience.

Creating an “ideal” response involved both LLM-generated drafts and manual refinement to avoid biasing the process toward the model’s intrinsic tendencies. This balanced approach facilitated the development of a set of heuristics, such as maintaining a 500-character limit for summaries with a 5-to-1 compression ratio. These rules could be implemented using straightforward Python scripts, demonstrating their practical applicability.

Once established, these heuristics served as the basis for evaluating responses, allowing us to score them and compare the outcomes across different models and prompts. This comparison shed light on the relative quality of the responses, informing further optimization efforts.

What’s next: Automated prompt generation

Having established a feedback loop and generated synthetic data for comparison, the next challenge lies in automating the creation of prompts. This step promises to streamline the optimization process, reducing reliance on manual intervention and facilitating more efficient experimentation.

The journey toward automated prompt engineering is a testament to the evolving interaction between humans and AI, highlighting the potential for continuous improvement in our quest for optimal language model performance. The insights gained from this process not only enhance our understanding of prompt design but also pave the way for future advancements in AI-driven communication solutions.

--

--