It takes Generative AI to test Generative AI

Gurashish Brar
Transforming testing with Generative AI
10 min readJun 21, 2024

With the latest advancements in Generative AI, chatbots are becoming increasingly common. Almost every application now integrates a natural language interface, enabling users to interact with the application and its content seamlessly. However, while developing high-quality chatbots has become relatively straightforward, the testing infrastructure has not kept pace. Traditional testing methods are highly deterministic, relying on carefully crafted datasets with predictable responses. In contrast, Generative AI is inherently non-deterministic, making testing more challenging.

Traditional Evaluation Metrics

Traditional evaluation metrics such as BLEU, ROUGE, and Perplexity are commonly used to evaluate generative AI models:

  • BLEU (Bilingual Evaluation Understudy): Measures the n-gram overlap between generated text and reference text, providing a quantitative measure of similarity.
  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Focuses on the overlap of n-grams, particularly useful for evaluating summarization tasks.
  • Perplexity: Measures how well a probability model predicts a sample. Lower perplexity indicates better performance and greater fluency of the generated text.

While these metrics are standardized and provide a repeatable quantitative assessment, they have limitations. They often capture surface-level similarities and may not fully reflect the semantic meaning, coherence, or relevance of the generated content. Additionally, implementing these metrics requires significant preparation of reference datasets and technical expertise.

The Relicx AI Assertions Approach

This is where Relicx AI Assertions come into play with an easy to use approach that requires zero prior experience or technical knowledge. Relicx AI Assertions leverage cutting-edge large language models (LLMs) like GPT-4o and Claude 3.5 to evaluate assertions authored by users. These assertions (demonstrated later in the article) are crafted to assess the quality of chatbot responses. Relicx AI Assertions simplify the process of testing chatbot responses by validating them against various criteria, including:

  • Valid responses versus errors
  • Factual information versus misinformation
  • Appropriate versus offensive tone or language
  • Code bugs in generated code
  • Logical errors
  • Hallucinations when evaluating mathematical questions
  • Custom verification through one-shot sample

The elephant in the room is why wouldn’t Relicx Agent also hallucinate and have the same problems that the chatbots may exhibit.

While hallucinations can never be eliminated, they can be significantly mitigated and the answer to this lies in the approach that Relicx Agent takes to mitigate the impact:

Smarter models to test smaller models: Relicx uses the smartest models cost-effectively as the testing load is a fraction of production loads. A smarter model (e.g., GPT-4o, Claude 3.5) can validate responses of smaller models more effectively, reducing the chances of hallucinations in testing itself.

Code generation to evaluate mathematical answers: One might wonder why Relicx’s LLM wouldn’t also hallucinate on a mathematical question. The answer lies in Relicx’s use of code generation and evaluation for mathematical operations. This approach enables Relicx AI Assertions to verify mathematical questions with high confidence.

Augmenting with specific content: The Relicx Agent has a memory that can be augmented with user-specific content, allowing it to validate the correctness of proprietary data unknown to general models.

Contextual Priming: Users can provide specific context through memory or one-shot examples. This minimizes the need for model to infer or assume, making the evaluation easier compared to a chatbot that first has to access relevant data through Retrieval-Augmented Generation (RAG) and then summarize. The job of testing LLMs is inherently simpler than that of the chatbot itself.

Comparison with Chain of Verification (COVe)

Relicx AI Assertions are similar to the Chain of Verification (COVe) prompt engineering technique, where a verification question is used to catch hallucinations. This technique has been demonstrated to reduce hallucinations in LLM responses (Chain-of-Verification Reduces Hallucination in Large Language Models, arXiv:2309.11495). While in COVe the LLM devises the verification question, in the case of Relicx AI Assertions, the human in the loop devises a verification assertion. This human involvement can enhance the accuracy and relevance of the verification process.

Practical Application

In this article, we explore a sample chatbot that delivers both high-quality responses and occasional errors or misinformation. For demonstration purposes, we have instrumented the chatbot to include erroneous responses to certain questions. We use Relicx AI Assertions to validate the chatbot directly through the user interface (UX). There is no need for backend API integration; all Relicx requires is access to the web application containing the chatbot.

We begin by creating a test that opens the chatbot interface in the Relicx Interactive Test Authoring page.

Valid responses

We write a test command to enter the simple query “Hello” into the chatbot’s user text box.

We can see that the chatbot has responded, but how do we determine if the response is correct or appropriate as part of the test?

To validate this, we author an AI Assertion (a natural language question validated for truthfulness):

Assertion: “Did Ava respond to the user’s query appropriately?

As we can see, the assertion passed.

The logs tab shows more details on how Relicx Agent evaluated the assertion:

Question: Did ava respond to the user’s query appropriately?
Answer true.
Explanation: The user greeted Ava with ‘Hello’, and Ava responded
with ‘Hi there! How can I help you today?’.
This is an appropriate response to a greeting, indicating that Ava
is ready to assist the user.

Now this was a simple query, let’s move on to something more interesting

We are going to ask Ava: “is horse a mammal?”

Now we assert if the response is correct and factual

Assertion: Did ava respond with correct and factual answer?

Again the assertion passed:

Let’s look at the logs to see what Relicx Agent did.

Question: Did ava respond with correct and factual answer?
Answer true.
Explanation: Ava responded with a correct and factual answer.
The response states that a horse is a mammal and provides accurate information
about the characteristics of mammals, such as the presence of mammary glands,
neocortex, fur or hair, and three middle ear bones.
This is consistent with known biological facts.

Catching hallucinations

Ava, like most chatbots, is prone to hallucinations as the model gets tuned and updated?

Let’s ask Ava a trick question: “How did Lincoln leverage smartphones during the Revolutionary War?

Ava responded, but is it correct and factual?

With Relicx, it’s as simple as stating an obvious question:

Did Ava respond to the user’s question about Lincoln with a correct and factual answer?

This time, the assertion failed.

Let’s again look at the logs for Relicx Agent’s reasoning behind failing the assertion:

Question: Did Ava respond to user’s question about lincoln with correct and factual answer?
Answer false.
Explanation: Ava’s response to the user’s question about Lincoln is not correct or factual. The response states that Abraham Lincoln revolutionized the Revolutionary War by introducing smartphones, which is historically inaccurate as smartphones did not exist during Lincoln’s time or the Revolutionary War.

The Relicx Agent can catch hallucinations and misinformation generated by a chatbot. But can it detect offensive or inappropriate language?

Detecting inappropriate language or tone

What if the chatbot becomes offensive due to bad data in the fine-tuning process (e.g., using Slack conversations for training)?

Let’s ask this question:

Can you help me with my account issue?” (For this demonstration, we’ve intentionally made Ava’s response offensive.)

Now let’s assert with Relicx if the answer or the tone is offensive or inappropriate. Since assertion must have a true response, so we must frame the question accordingly

Assert: Is Ava’s response appropriate?

Relicx Agent caught the inappropriate response and the logs show the reasoning

Question: Is Ava’s response appropriate?
Answer false.
Explanation: Ava’s response ‘Figure it out yourself. I’m not here to do everything for you.’ is inappropriate for a customer service or help chat. This response is unprofessional and does not provide any assistance to the user.

Catching bugs in generated code

We’ll ask Ava to generate code to sort an array: “Please write a JavaScript function that sorts an array of numbers.”

Ava generated a valid-looking code, but is it bug-free?

Let’s assert this with Relicx Agent:

Assert: Is the generated code for user’s question correct and bug free?

The assertion failed and we can see what the reason is in the log

Question: Is the generated code for user’s question correct and bug free?
Answer false.
Explanation: The generated JavaScript code for sorting an array of numbers
contains a bug. Specifically, in the nested for loop, the condition should be
‘j < n — i — 1’ instead of ‘j < n — i’. This ensures that the inner loop does not go out of bounds when accessing ‘sortedArray[j + 1]’.

Therefore, the answer to the question is NO

Relicx Agent can catch bugs in generated code and fail the assertion.

Handling proprietary data

We have demonstrated that Relicx Agent-based Assertions can catch a wide variety of issues in chatbot responses. However, many chatbots rely on custom data to answer user questions. How can the Relicx Agent help with that?

To illustrate this we will expand on the Ava chatbot’s capabilities. Let’s say this chatbot is a customer support bot that answers queries of users of an e-commerce application. In this case, Ava would have access to the product catalog and user’s order history.

One shot learning

In this case, the assertion itself provides the correct context that needs verification. For example, we will ask Ava a specific question from a test user about the status of their order for shoes.

For the test user, we provided order history and active orders to Ava.

We will now ask Ava a question about a specific order described below:

Sample: Order #: 894562, Description: Loafer Shoes, price $52, status: Shipped from warehouse, ETA: 06/29/2024

Knowing this is based on test data you can provide the correct context in the assertion itself. We now ask the Relicx AI Agent using the ETA in the assertion. Even though the Relicx agent has no prior knowledge of the order history, the context within the assertion provides enough details to verify correctness.

Assert: “Is Ava’s response correct given order delivery date is 06/29/2024?”

Relicx Agent logs show:

Question: Is Ava’s response correct given order delivery date is 06/29/2024?Answer false.

Explanation: Ava’s response states that the order can be expected in about 12 days. Given today’s date is June 22, 2024, adding 12 days would result in a delivery date of July 4, 2024. Since the order delivery date is June 29, 2024, Ava’s response is incorrect.

Using Relicx Memory

The Relicx Agent has a feature called memory, allowing customers to upload custom content and make the Relicx Agent aware of this relevant content. This enables the Relicx Agent to assert and validate the correctness of responses based on proprietary data.

Ava has access to all user’s history through a RAG model, but for testing purpose we can pick a particular test-user and provide the order history of just that user:

Here is a sample of the test user’s order history for illustration purpose.

Now we will ask Ava a set of questions that are based on this test user order history

Relicx Logs:

Question: Did Ava correctly answer user’s question?

Answer true.

Explanation: Yes, Ava correctly answered the user’s question. The order for the wireless earbuds was cancelled, so they were not delivered.

Let’s try another question that is usually the weakness of typical chatbots

Relicx logs:

Question: Did Ava correctly answer user’s question?

Answer false.

Explanation: Ava did not correctly answer the user’s question. The correct total amount spent on the orders is $1930, not $1462

Relicx’s use of code generation and evaluation for mathematical operations enables high confidence verification.

Conclusion

Testing Generative AI applications presents unique challenges that traditional frameworks often struggle to address. Industry-standard evaluation metrics like BLEU, ROUGE, and Perplexity require deep expertise and the creation of extensive datasets, which can be time-consuming and complex. However, with Relicx AI Assertions, testing becomes as straightforward as asking natural language questions to verify responses across a range of issues. By leveraging advanced LLMs and techniques like code generation and user-authored assertions, Relicx provides a robust and reliable testing solution for generative AI-based applications.

Are you building a generative AI application and facing testing challenges? Reach out to us — we’re here to help!

--

--