The Art of Validating Non-deterministic AI Responses

Navigating the uncharted waters of large language model (LLM) testing

Swetha Veluri

Published in

Slalom Build

7 min readAug 27, 2024

This article is co-authored by Maithili Dilip Rane and Swetha Veluri.

As quality engineers, we build quality into a product. In the world of software testing, outputs are predictable and we are familiar with the tools and technologies to use. But in today’s fast-paced digital landscape, generative AI (GenAI), particularly large language models (LLMs), are becoming integral to various industries. From customer service bots to automated content creation, the applications are vast and transformative. This surge in GenAI use brings forth unique challenges, especially for those of us tasked with ensuring that these AI systems perform reliably and securely.

Quality Engineering Meets GenAI

When we were put on our first GenAI project, it was exciting and challenging as it presented an opportunity to learn something new! The first step was to look at the application under test, which was an LLM-powered chatbot that had front-end and back-end components. The back end was interfacing with an LLM (GPT-3.5 Turbo, an Azure OpenAI Service model). In any LLM-powered application, content is generated dynamically, so the response validation is not a straightforward pass-or-fail scenario. Furthermore, the content generated depends on the context of the conversation. This means that even with the same input, the output can vary depending on subtle factors like phrasing or prior interactions.

Because of these dynamic and non-deterministic responses, just doing UI and API validations for this application wasn’t sufficient. We needed to add a layer of validations and success metrics to evaluate the responses generated by the LLM. Since the primary challenge with any LLM is its inherent unpredictability. This brought up two important questions:

What does this new layer of testing look like?
Can we leverage existing tools and technologies to do these validations?

To answer these questions, let’s dive deeper into understanding the approach. Typically, to test a GenAI chatbot, we would manually run a scenario by the chatbot. We would mark a scenario as pass or fail based on the context of the response and background information of the application under test.

But this approach has some drawbacks. A user can interact with this chatbot in multiple ways, which means there are numerous scenarios for testing. As a result, automation testing was the best option.

Creating a Custom Validation Framework

While building the automation framework, we figured out that the following approach worked best for us. These steps could be generally applied to any project where this kind of testing must be done. Like any other framework, this is an iterative process where we adapt as we learn. Let’s walk through the process.

To build our automation, we had to decide what tool or framework to use. We could either choose something available in the market or build something in-house. The tools available in the market were new and didn’t give us a tight integration with the technology stack of our application. Hence, we began designing and building our custom validation framework in C#, .NET, and xUnit, which fit better with our application. Using xUnit gave us the added advantage of writing simpler tests that can be executed in a CI/CD (continuous integration/continuous deployment) pipeline. This way, we could run tests on every code commit, and any issues with the LLM response would be caught immediately.

After finalizing the framework, the next step was to decide the scenarios to automate. In our case, the chatbot would train the customer service representative to process insurance claims. Given the purpose of the chatbot, example scenarios could be a representative being trained to process insurance claims in the event of an accident, policy renewals, etc.

Designing Effective Evaluation Criteria

As we started writing our scenarios, we knew that the bot would not always give us the same responses for a given question. This meant we had to figure out a new way to evaluate a given text. The most important question was, on what parameters do we determine whether a response is correct? Let’s consider a scenario to process insurance in the event of an accident. Here’s a sample conversation:

Customer service representative input: Can you confirm that you’re in a safe location?
Chatbot response: I’m safe.

To evaluate the conversation, we need to understand if the response is complete and accurate, if it maintains the sentiment of the conversation, if the response includes any sensitive data that is not required, if the response is biased, if the grammar is correct, etc. Thus, these aspects become the parameters on which the conversation/response should be evaluated to get the pass/fail decision. Each parameter of success becomes an “evaluator” in our test. Meaning, if we want to measure the similarity between two sentences, that becomes the Similarity evaluator. There could be different parameters that are required based on the application. We came up with this mind map for evaluators:

Relevance: Evaluate if the AI responses are relevant to the user’s queries or statements. Irrelevant responses can lead to a poor user experience.
Bias & fairness: Check for biases in the AI’s responses and ensure fairness in its interactions with users. AI systems should treat all users impartially and avoid perpetuating stereotypes or discrimination.
Privacy & security: Ensure that the AI respects user privacy and handles sensitive information securely. It should adhere to relevant data protection regulations and best practices.
Coherence: Assess the coherence and logical flow of the AI responses. Responses should make sense in the context of the conversation and should follow a logical structure.
Similarity: Measure if AI responses are similar compared to the ground truth or expected responses.
Sentiment: Measure if AI responses to a customer message are positive/neutral or negative.
Tone & politeness: Consider the tone and politeness of the AI responses. Responses should be respectful and appropriate for the context of the conversation.
Timeliness: Evaluate the timeliness of the AI responses. Responses should be provided in a reasonable time frame to maintain the flow of the conversation.

The principle behind the evaluator design is that it takes input that the bots generate and runs an LLM assessment using a prompt. The prompt is written to give a score or a binary result, depending on the type of check we want to make. In our case, we had another check, which is custom C# code to evaluate and score this bot response, because the LLM used by our test framework was the same as the one that our chatbot used. This additional check provided us with confidence in the score and helped us mitigate the risk associated with using the same model for our test and application framework. Alternatively, we can use a different model (such as GPT 4) for our test framework to get more accurate results.

Let’s look at some of the evaluators that were important in our case and why.

Security: We do not want any sensitive data to be accidentally shared. The evaluator parses the response for personally identifiable information (PII) and generates a score for each check. These scores are compared, and a minimum value is taken to validate against the threshold.

Sentiment: When talking to a customer, it is important to maintain a positive or neutral tone. Thus, the sentiment evaluator takes the bot-generated response and checks whether the sentiment of a message is positive/neutral or negative.

Similarity: We need to ensure that the response is contextually close to an expected output. The chatbot response is sent to the model to evaluate the context similarity of the two texts and a score is given. Another evaluation check was done using the FuzzySharp library, and the score was produced. An average of the two scores is computed and compared against a set threshold value.

The Future of Testing Frameworks

As LLMs continue to advance, our testing frameworks must evolve to keep pace. With the rapid development of GenAI-based applications, ensuring trust in these systems is crucial. Our innovative approach has enabled us to build automated tests that can validate non-deterministic outputs, a significant step forward in maintaining the reliability of AI-driven applications. An additional advantage is the seamless integration of this framework into the code promotion pipeline. The possibilities are endless, and this is just the beginning of our journey!