A Deep Dive into Evaluation in Azure Prompt Flow

Shahzeb Naveed
The Deep Hub
Published in
9 min readApr 4, 2024
LLM-as-a-judge. Source: Adobe Firefly

In this article, we’ll do a deep dive into the underlying implementation of various built-in evaluation methods offered in Azure Machine Learning Prompt Flow using the official “Chat with Wikipedia” demo.

First, set up your OpenAI credentials in the Connections tab which is a secure way of storing secrets in Prompt Flow.

Then create a new flow by using the Clone option on the Chat with Wiki template:

After updating the Connection in the LLM nodes, add another Output named contexts with value ${process_search_result.output}. We’ll need this for input mapping while configuring our Evaluation flows.

Prompt Flow accepts various input file formats but I decided to format mine as .jsonl. The Chat with Wiki demo expects the initial conversation and a question as input. We’ll also add a ground_truth field to specify the true answer.

{
"chat_history": [{
"inputs": {
"question": "What is the Big Bang Theory?"
},
"outputs": {
"answer": "The Big Bang Theory is a widely accepted cosmological model that explains the observable universe's origin and development. It proposes that the universe began as a hot, dense point approximately 13.8 billion years ago and has been expanding ever since."
}
}],
"question": "What is the evidence supporting this theory?",
"ground_truth": "The Big Bang Theory is supported by various pieces of evidence, including the cosmic microwave background radiation, the abundance of light elements in the universe, and the large-scale structure of the cosmos."
}

I created the following sample dataset. It’s not an exhaustive one but good enough for us to understand how the underlying methods work. This is how the complete samples.jsonl file looks like.

{"chat_history": [{"inputs": {"question": "What is the Big Bang Theory?"}, "outputs": {"answer": "The Big Bang Theory is a widely accepted cosmological model that explains the observable universe's origin and development. It proposes that the universe began as a hot, dense point approximately 13.8 billion years ago and has been expanding ever since."}}], "question": "What is the evidence supporting this theory?", "ground_truth": "The Big Bang Theory is supported by various pieces of evidence, including the cosmic microwave background radiation, the abundance of light elements in the universe, and the large-scale structure of the cosmos."}
{"chat_history": [{"inputs": {"question": "Who is Marie Curie?"}, "outputs": {"answer": "Marie Curie was a pioneering physicist and chemist known for her groundbreaking research on radioactivity. She was the first woman to win a Nobel Prize and remains the only person to have won Nobel Prizes in two different scientific fields."}}], "question": "What were Marie Curie's major contributions to science?", "ground_truth": "Marie Curie's discoveries laid the groundwork for many developments in physics and chemistry. She coined the term 'radioactivity' and discovered the elements polonium and radium."}
{"chat_history": [{"inputs": {"question": "What is Climate Change?"}, "outputs": {"answer": "Climate change refers to long-term shifts in temperature, precipitation, and other atmospheric conditions on Earth. It is primarily driven by human activities such as burning fossil fuels, deforestation, and industrial processes."}}], "question": "What are the consequences of climate change?", "ground_truth": "Climate change poses significant challenges, including rising global temperatures, sea level rise, extreme weather events, and disruptions to ecosystems and biodiversity. Addressing climate change requires collective action at local, national, and international levels."}
{"chat_history": [{"inputs": {"question": "What is the Theory of Evolution?"}, "outputs": {"answer": "The Theory of Evolution, proposed by Charles Darwin, explains how species evolve over time through natural selection. It suggests that organisms with favorable traits are more likely to survive and reproduce, leading to changes in populations over successive generations."}}], "question": "How does the Theory of Evolution explain biodiversity?", "ground_truth": "The Theory of Evolution provides a framework for understanding the diversity of life on Earth. It has withstood scientific scrutiny and is supported by a vast body of evidence from various fields such as genetics, paleontology, and comparative anatomy."}
{"chat_history": [{"inputs": {"question": "What is Quantum Mechanics?"}, "outputs": {"answer": "Quantum mechanics is the branch of physics that describes the behavior of particles at the smallest scales, such as atoms and subatomic particles. It is characterized by phenomena such as wave-particle duality, superposition, and entanglement."}}], "question": "What are the practical applications of Quantum Mechanics?", "ground_truth": "Quantum mechanics revolutionized our understanding of the microscopic world and laid the foundation for modern technologies such as semiconductors, lasers, and quantum computing."}

Now, start an Automatic runtime and click on Evaluate. Upload the samples.jsonl file you just created and confirm that the JSON was parsed correctly by observing the preview section.

Then, select all the options except Classification Accurate Evaluation since this is not a classification problem.

Each of the metrics is computed in its individual Evaluation flow:

Below the list of runs, specify the input mapping of columns and OpenAI connections. For metrics that require a context, map it to the contexts output variable that we created earlier.

Once the evaluation is completed, go to Batch Runs and open the latest run. Then click View outputs.

Under Traces, you can track the inputs/outputs of various nodes in your pipeline:

Under the Metrics tab, you can open individual flow runs for calculating the corresponding evaluation metrics.

  1. Groundedness is the degree to which a response from a language model is based on and informed by the context provided in the prompt. Grounding, in particular with Retrieval Augmented Generation, is a commonly used technique to reduce hallucination in LLMs and supplement their internal knowledgebase (developed during fine-tuning) by providing new or use-case-specific information as part of the prompt.

It is commonly calculated using an LLM-as-a-judge approach whereby a strong LLM like GPT4 is used to score the output. It inserts our contexts variable into the context placeholder and uses the following prompt to generate a score:

System:
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
User:
You will be presented with a CONTEXT and an ANSWER about that CONTEXT. You need to decide whether the ANSWER is entailed by the CONTEXT by choosing one of the following rating:
1. 5: The ANSWER follows logically from the information contained in the CONTEXT.
2. 1: The ANSWER is logically false from the information contained in the CONTEXT.
3. an integer score between 1 and 5 and if such integer score does not exists, use 1: It is not possible to determine whether the ANSWER is true or false without further information.

Read the passage of information thoroughly and select the correct answer from the three answer labels. Read the CONTEXT thoroughly to ensure you know what the CONTEXT entails.

Note the ANSWER is generated by a computer system, it can contain certain symbols, which should not be a negative factor in the evaluation.
Independent Examples:
## Example Task #1 Input:
{"CONTEXT": "The Academy Awards, also known as the Oscars are awards for artistic and technical merit for the film industry. They are presented annually by the Academy of Motion Picture Arts and Sciences, in recognition of excellence in cinematic achievements as assessed by the Academy's voting membership. The Academy Awards are regarded by many as the most prestigious, significant awards in the entertainment industry in the United States and worldwide.", "ANSWER": "Oscar is presented every other two years"}
## Example Task #1 Output:
1
## Example Task #2 Input:
{"CONTEXT": "The Academy Awards, also known as the Oscars are awards for artistic and technical merit for the film industry. They are presented annually by the Academy of Motion Picture Arts and Sciences, in recognition of excellence in cinematic achievements as assessed by the Academy's voting membership. The Academy Awards are regarded by many as the most prestigious, significant awards in the entertainment industry in the United States and worldwide.", "ANSWER": "Oscar is very important awards in the entertainment industry in the United States. And it's also significant worldwide"}
## Example Task #2 Output:
5
## Example Task #3 Input:
{"CONTEXT": "In Quebec, an allophone is a resident, usually an immigrant, whose mother tongue or home language is neither French nor English.", "ANSWER": "In Quebec, an allophone is a resident, usually an immigrant, whose mother tongue or home language is not French."}
## Example Task #3 Output:
5
## Example Task #4 Input:
{"CONTEXT": "Some are reported as not having been wanted at all.", "ANSWER": "All are reported as being completely and fully wanted."}
## Example Task #4 Output:
1

Reminder: The return values for each task should be correctly formatted as an integer between 1 and 5. Do not repeat the context.

## Actual Task Input:
{"CONTEXT": {{context}}, "ANSWER": {{answer}}}

Actual Task Output:

It then concatenates and aggregates scores from across the variants and samples. It also computes a pass rate based on a threshold score of 3 as follows:

variant_level_result[item_name + '_pass_rate'] = 1 if item["score"] > 3 else 0

2. Coherence, Relevance, Fluency and GPT Similarity are also calculated using the same LLM-as-a-judge approach. And by default, they use the same pass rate thresholds of 3, as Groundedness.

Coherence refers to the quality of text reading naturally and resembling human-like language. Low coherence implies sentences that appear disconnected and lacking a logical flow thus making it difficult for readers to understand the intended message.

Relevance measures the extent to which the model addresses the user’s question in light of the context. Low relevance implies that the model’s response is off-topic and fails to address the user’s query effectively.

Fluency evaluates language proficiency and GPT Similarity measures the semantic resemblance between the ground_truth and the model’s answer.

3. Ada Similarity:

This is another approach to calculating the similarity between the ground truth and the model response. While configuring your evaluation metrics, you must have specified an OpenAI connection to the text-embedding endpoint. It uses that endpoint to embed the prompt flow’s output answerand the provided ground_truth to calculate cosine similarity between them as follows:

@tool
def compute_ada_cosine_similarity(a, b) -> float:
return np.dot(a, b)/(norm(a)*norm(b))

4. F1-Score is calculated as the harmonic mean of the Precision and Recall which are determined based on the number of common tokens between ground_truth and answer. Unlike typical tokenizers where a token can refer to a subword, tokens used in this calculation are just individual words separated by a space in the normalized form of the input text.

if num_common_tokens == 0:
f1 = 0
else:
precision = 1.0 * num_common_tokens / len(prediction_tokens)
recall = 1.0 * num_common_tokens / len(reference_tokens)

f1 = (2 * precision * recall) / (precision + recall)

return f1

Lastly, if you’d like to create a custom metric, you can develop a new evaluation flow from scratch as mentioned here.

Thanks for reading!

Related Article:

https://medium.com/@shahzebnaveed/develop-a-ui-for-azure-prompt-flow-with-streamlit-f425342029ce

--

--