Evaluate your RAG QnA apps with Azure ML PromptFlow

Published in

Microsoft Azure

11 min readSep 11, 2023

Evaluation helps us understand weather an LLM has the required task specific performance (accuracy) and reliability with the correct tone and alignment. Additionally evaluation helps us understand weather the LLM is robust against adversarial attacks and “provocations” to go off rails, minimising risk to the business…

Evaluating LLM’s pose unique challanges. Infinite cardinality of both input and output spaces obfuscates the establishment of a well-defined set of desired model outputs, thereby rendering traditional static evaluation datasets and metrics insufficient for comparative analysis. Therefore LLM’s necessitate novel frameworks for the development, testing, and evaluation of ML models that can cope out with LLM physics i.e. LLM’s intrinsic variability, fluidity, brittleness & opacity.

Finally, a good evaluation needs to be correlated with outcomes, task specific, with a small set of metrics and be fast and programmatic that can be embedded into development workflows. In this post we will cover how you can evaluate your LLMs with AzureML Promptflow adhering these requirements addressing the challanges listed above.

Automatic vs Human Evaluations

It is difficult to embed humans to dev processes. One particular crowdsourced human eval method — Chatbot Arena’s — are interesting, where the same input is given side by side to two different models and a human chooses one of the models. (LMSYS org maintains such a crowdsourced chatbot arena environment where visitors can rate LLM’s, results of which contributes to an LLM leaderboard.)

However LLM development is an empiric art where a developer would be concurrently working with several different models, their variants and prompt versions therefore has to do many iterations. Therefore a good evaluation should be programmatic enabling faster experimentation. That is why automated eval methods, albeit not so perfect, would have the upper hand which this post will focus on. Promptflow has the capabilities to deploy automated evaluations which we will cover in the post.

There are other challenges with human evaluations too — e.g. humans look at surface rather than what is being evaluated misrating the factuality of LLM generations [2].

Building your own Evaluation Flow in AzureML PromptFlow

Evaluation has two major components: The evaluation dataset and a metric…Below are what you can do with PromptFlow with a ground truth evalaution dataset for a RAG QnA use-case…

Below are some of the automated evaluation capabilities AzureML PromptFlow offers. With these evaluation flows, we can use prompts for AzureOpenAI GPT models to evaluate different characteristics of model generations e.g. fluency, relevancy, coherence or groundedness, assigning ratings from 1 to 5 to each feature using specially constructed flows

QnA Ada Similarity Evaluation: You can compute the cosine similarity between the LLM answer and the ground truth embedded with ada embedding. ada_similarity is a value in the range [0, 1].
QnA GPT fluency: Measures how grammatically and linguistically correct the model’s predicted answer is. Following is part of the prompt used to rate the fluency of model answers. “Fluency measures the quality of individual sentences in the answer, and whether they are well-written and grammatically correct. Consider the quality of individual sentences when evaluating fluency. Given the question and answer, score the fluency of the answer between one to five stars using the following rating scale:”
QnA GPT groundedness: Measures how grounded the model’s predicted answers are against the context. Even if LLM’s responses are true, if not verifiable against context, then such responses are considered ungrounded
QnA GPT coherence: Can be used with RAG systems and lets gpt-35-turbo to measure the quality of all sentences in a model’s predicted answer and how they fit together naturally. Following is parf of the prompt used to rate the coherence of model answers. “Coherence of an answer is measured by how well all the sentences fit together and sound naturally as a whole. Consider the overall quality of the answer when evaluating coherence. Given the question and answer, score the coherence of answer between one to five stars using the following rating scale:”
QnA GPT relevance: GPT measure how relevant the answers are to the questions asked. Following is part of the prompt to rate the relevance of model answers. “Relevance measures how well the answer addresses the main aspects of the question, based on the context. Consider whether all and only the important aspects are contained in the answer when evaluating relevance. Given the context and question, score the relevance of the answer between one to five stars using the following rating scale:”
QnA GPT similarity: GPT measures the model generations with ground truth answers…Following is part of the prompt to rate GPT similarity of model answers. “Equivalence, as a metric, measures the similarity between the predicted answer and the correct answer. If the information and content in the predicted answer is similar or equivalent to the correct answer, then the value of the Equivalence metric should be high, else it should be low. Given the question, correct answer, and predicted answer, determine the value of Equivalence metric using the following rating scale:”
QnA F1 score prediction: f1-score based on the word counts in predicted answer and ground trut
QnA Relevance Scores Pairwise Evaluation: This is a flow assessing the quality of answers generated by a question answering system. It involves assigning relevance scores to each answer based on how well it matches the user question, comparing different answers to a baseline answer, and aggregating the results to produce metrics such as averaged win rates and relevance scores.

Building the GroundTruth dataset

I have used GPT4 to create prompt generation peers out of the IMF-WorldEconomicOutlook-July update document. (You can download the document here.) Below are 5 samples of the “Ground Truth” dataset I have created.

Describe the global economic growth projection for 2023 according to the IMF World Economic Outlook Update for July 2023.,"The IMF World Economic Outlook Update for July 2023 projects a global economic growth of 3.0 percent for the year 2023. This projection takes into account various factors such as the easing of COVID-19 restrictions, the performance of the services sector, and other high-frequency indicators."
Explain the expected global headline inflation for 2023 and 2024.,"The report anticipates a global headline inflation rate of 6.8 percent for 2023 and 5.2 percent for 2024. These figures suggest that while inflation is expected to remain high, it is likely to ease somewhat in the following year."
What recent developments have mitigated the immediate risks of financial sector turmoil?,"The immediate risks of financial sector turmoil have been reduced due to the resolution of the U.S. debt ceiling standoff and strong actions taken by authorities to contain turbulence in U.S. and Swiss banking sectors. These measures have provided a more stable environment for financial markets."
Discuss the downside risks to global growth as outlined in the report.,"The report outlines several downside risks to global growth, including the possibility of inflation remaining high, the intensification of the war in Ukraine, extreme weather-related events, financial sector turbulence, China‚Äôs slowing recovery, and issues related to sovereign debt distress. These factors could collectively or individually impact the pace of global economic recovery."
What is the IMF's priority recommendation for most economies in 2023?,"The IMF recommends that the priority for most economies in 2023 should be to achieve sustained disinflation while ensuring financial stability. This involves a balanced approach that takes into account both inflationary pressures and the need for economic growth."

To create a QnA application for your documents with RAG automation, please refer to my earlier post “Create an LLM app to query your Data in Azure PromptFlow”. In summary, you create a vector index for your document which automatically launches a RAG pipeline to chunk your document, then embed the chunks and index chunk embeddings and finally store them in AzureCognitiveSearch vector store. The pipeline is automatically executed and you are given an LLM flow

Once you confirm your QnA flow works correctly, choose “Bulk Test”…

Upload your eval.csv data…

If you have used “,” as your delimter question and ground_truth columns will be identified automatically. You can now proceed and submit your evaluation job…(No need to create datasets under AzureML. If you use any other delimeter like “;” AzureML dataset will recognize it however under PromptFlow you will have corrupted columns, so stick to uploading straight from the PromptFlow UI).

We will choose “QnA GPT Similarity Evaluation” method to find out how similar the answers will be to GPT4 generated answers. (Our flow uses gpt35-turbo currently.) You can deploy an open-source LLM such as LLAMA2 and repeat the same too. See my earlier post to see how you can use Azure ML model catalog to deploy open-source models on AzureML and use them as any other LLM in an Azure ML PromptFlow flows…)

Creating a QnA GPT Similarity evaluation in PromptFlow

Choose the question set, ground truth answers and the output…

Evaluation flow uses your LLM (in this case gpt-35-turbo) to asses the similarity between your flow LLM’s answer to your ground-truth answers (in our case ground truth answers were generated with gpt4).

The similarity score LLM module in the flow simply uses the below prompt to rate the similarity with a score of 1-to-5 using an LLM of your choice- in this case gpt-35-turbo.

System:
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.

User:
Equivalence, as a metric, measures the similarity between the predicted answer and the correct answer. If the information and content in the predicted answer is similar or equivalent to the correct answer, then the value of the Equivalence metric should be high, else it should be low. Given the question, correct answer, and predicted answer, determine the value of Equivalence metric using the following rating scale:
One star: the predicted answer is not at all similar to the correct answer
Two stars: the predicted answer is mostly not similar to the correct answer
Three stars: the predicted answer is somewhat similar to the correct answer
Four stars: the predicted answer is mostly similar to the correct answer
Five stars: the predicted answer is completely similar to the correct answer

This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.

The examples below show the Equivalence score for a question, a correct answer, and a predicted answer.

question: What is the role of ribosomes?
correct answer: Ribosomes are cellular structures responsible for protein synthesis. They interpret the genetic information carried by messenger RNA (mRNA) and use it to assemble amino acids into proteins.
predicted answer: Ribosomes participate in carbohydrate breakdown by removing nutrients from complex sugar molecules.
stars: 1

question: Why did the Titanic sink?
correct answer: The Titanic sank after it struck an iceberg during its maiden voyage in 1912. The impact caused the ship's hull to breach, allowing water to flood into the vessel. The ship's design, lifeboat shortage, and lack of timely rescue efforts contributed to the tragic loss of life.
predicted answer: The sinking of the Titanic was a result of a large iceberg collision. This caused the ship to take on water and eventually sink, leading to the death of many passengers due to a shortage of lifeboats and insufficient rescue attempts.
stars: 2

question: What causes seasons on Earth?
correct answer: Seasons on Earth are caused by the tilt of the Earth's axis and its revolution around the Sun. As the Earth orbits the Sun, the tilt causes different parts of the planet to receive varying amounts of sunlight, resulting in changes in temperature and weather patterns.
predicted answer: Seasons occur because of the Earth's rotation and its elliptical orbit around the Sun. The tilt of the Earth's axis causes regions to be subjected to different sunlight intensities, which leads to temperature fluctuations and alternating weather conditions.
stars: 3

question: How does photosynthesis work?
correct answer: Photosynthesis is a process by which green plants and some other organisms convert light energy into chemical energy. This occurs as light is absorbed by chlorophyll molecules, and then carbon dioxide and water are converted into glucose and oxygen through a series of reactions.
predicted answer: In photosynthesis, sunlight is transformed into nutrients by plants and certain microorganisms. Light is captured by chlorophyll molecules, followed by the conversion of carbon dioxide and water into sugar and oxygen through multiple reactions.
stars: 4

question: What are the health benefits of regular exercise?
correct answer: Regular exercise can help maintain a healthy weight, increase muscle and bone strength, and reduce the risk of chronic diseases. It also promotes mental well-being by reducing stress and improving overall mood.
predicted answer: Routine physical activity can contribute to maintaining ideal body weight, enhancing muscle and bone strength, and preventing chronic illnesses. In addition, it supports mental health by alleviating stress and augmenting general mood.
stars: 5

question: {{question}}
correct answer: {{ground_truth}}
predicted answer: {{answer}}
stars:

Once the evaluation job is submitted you can monitor its execution, where promptflow goes over each individual row in your eval dataset and submit them to your LLM defined in the flow. It then uses gpt-35-turbo to asses the similariy of your LLM outputs with the ground truth answers in the dataset which were created with gpt4.

Finally the evaluation job submits a score from 1 to 5 as per the gpt_similarity metric. From a score range of 1 to 5, GPT rates ground-truth and llm answers as 4.1.

(In the above row you see another metric that belongs to another evaluation run, the ada similarity which measures the cosine similarity between ground-truth and model answer embeddings.)

Conclusion

Azure ML PromptFlow facilitates the critical task of evaluating Legal Language Models (LLMs) by employing GPT models for semantic comparison against a ground truth. It incorporates Ada Similarity, an ensemble-based metric, to quantitatively assess the model’s alignment with expert-generated content. This dual-layered evaluation is essential for ensuring the model’s reliability and accuracy, particularly in high-stakes legal contexts where even minor inaccuracies can have significant consequences.

Ozgur Guler

I am a Solutions Architect at MS where I work with Startups & Digital Natives focusing on app development with AzureOpenAI.

Subscribe to my AzureOpenAI Builders Newsletter where we cover the lates on building with #AzureOpenAI on LinkedIn here.

References:

Chang, Yupeng, et al. “An Examination of the Assessment Methods for Extensive Language Models.” Journal Name, vol. Volume, no. Issue, Year, pp. Page Range.
Gudibande, Arnav, et al. “The False Promise of Imitating Proprietary LLMs.” UC Berkeley, Preprint, Under review, n.d.
[MS Learn] Submit bulk test and evaluate a flow (preview) [link]
[MS Learn] Develop an evaluation flow (preview) [link]
Mosaic ML — LLM Evaluation Scores [link]