Ragas metrics for Thai language

Chananop W
ibm-watsonx-th
Published in
5 min readAug 27, 2024

Authors: tanyaaton, Mew Leenutaphong, Chananop W Pongsasit Thongpramoon

As RAG applications are being built and productionized, the necessity for comprehensive monitoring and evaluation of NLP models becomes increasingly important.

Introduction

  1. Using BLEU,ROGUE score to evaluate
  • BLEU Score is the assessment of similarity between the machine-translated text and the reference translations using n-grams, which are contiguous sequences of n words. N-grams used are unigrams (single words), bigrams (two-word sequences), trigrams (three-word sequences), and so on.
  • ROGUE Score is the assessment of the similarity between the machine-generated summary and the source summaries using overlapping n-grams.

2. Using Custom Ragas metrics to evaluate

  • Ragas is a framework that helps evaluate your Retrieval Augmented Generation (RAG) pipelines. Ragas also consists of multiple method which for example are Faithfulness score, Answer relevancy score, Context recall score, Context precision score and etc. https://docs.ragas.io/en/stable/. -> For optimal performance and capturing semantic meaning this is preferred way as it uses LLM rather than Ngram to evaluate.

How does these metrics compare to Thai Language?

Thai language syntax differs from English, as Thai words are written without spaces between them (‘thaiwordarelikethis’), making it unsuitable for n-gram-based evaluation methods like BLEU and ROUGE. These methods, which rely on n-gram overlap to assess text, struggle to accurately evaluate Thai text since they can’t separate words that are packed together. Consequently, BLEU may inaccurately assess meaning and fluency in Thai, and ROUGE, which also depends on n-grams, may not optimally evaluate Thai language content.

Currently the RAGAS metrics is built based on GPT-4o model, which from a few testing in Thai language we get that the evaluation isn’t as good compared to english. Hence, we built a custom RAGAS repository based on llama3.1 models in IBM watsonx.ai (Smaller but cost effective model)

Using Custom Ragas metrics to evaluate Thai language content

Source : https://docs.ragas.io/en/latest/concepts/metrics/index.html

We focus on two scores for this repository (Faithfulness and answer relevency)

  1. Faithfulness
https://docs.ragas.io/en/latest/concepts/metrics/faithfulness.html

Faithfulness score is a methods for measuring the factual consistency of the generated answer compared to the given context. It is calculated from answer and retrieved context.

2. Answer Relevancy

https://docs.ragas.io/en/latest/concepts/metrics/answer_relevance.html

Answer Relevancy, is a methods of evaluating generated content which focuses on assessing how relevant the generated answer is comparing to the given prompt.

We implemented an opensourced repo — integrated with ibm watsonx.ai and Traditionally we used GPT4 to evaluate but in this case we decided to use llama3.1 to optimize cost.

Once we have chosen llama3.1 we pre-generated 80 HR policy related questions assessing with different areas like numerical, out of context, suggestions and etc. [repo link to be released]

you can directly modify the model used to evaluate/ path to evaluate model performance in a settings.yaml file.

Results

Using Faithfulness Score to evaluate models
Using Answer Releavancy Score to evaluate models
Using custom metrics for evaluation
def prompt_generation(reference, question, written_answer, answer):
messages = {
"messages": [
{
"content": "You are a careful grader tasked with evaluating the differences between a response from a model and an answer written by a human.",
"role": "system"
},
{
"content": f"""User question: {question}

Sources that was retrieved from vector database (Could be right or wrong):
{reference}

Model Answer written by human which is always right:
{written_answer}

Response from the model:
{answer}

Model Grading Criteria:

Accuracy of Information (0-4 points):
4 points: The response is perfectly accurate, matching the model answer without any discrepancies.
2-3 points: Minor differences from the model answer are present, but they do not significantly distort the informational content.
0-1 point: Major discrepancies from the model answer, or content not found in the model answer, greatly affecting the response's credibility.

Relevance and Completeness (0-3 points):
3 points: The response completely aligns with the model answer in terms of relevance, addressing the query in full.
1-2 points: The response generally aligns with the model answer but misses some important nuances or aspects.
0 points: Significant deviation from the model answer in terms of relevance or completeness, missing key portions of the query or altering the intended information.

Clarity and Coherence (0-2 points):
2 points: The response maintains the clarity and structure of the model answer, well-structured and coherent.
1 point: Minor deviations in clarity or coherence from the model answer, with some grammatical/style issues or logical inconsistencies.
0 points: Major divergence from the model answer in terms of structure or coherence, difficult to understand or poorly articulated.

Traceability (0-1 point):
1 point: All information in the response can be directly traced to and aligns with the model answer.
0 points: The response introduces conjectures or contains untraceable pieces of information not aligned with the model answer.

Qualitative Feedback:
Compare the model response with the human written response as the ground truth, to grade the model response out of 10.

Response Format:
Score out of 10: []

Response:""",
"role": "user"
},
{
"content": "Score out of 10: [",
"role": "assistant"
},
]
}
return messages["messages"]

The above is the prompt used for custom metrics.

We can see that by utilizing these method to evaluate the model the model which performs best in terms of faithfulness is Meta-Llama-3–1–70B, for answer relevancy is Meta-Llama-3–8B and the model which is the best when for custom metric is Meta-Llama-3.1–405.

Conclusion

In conclusion, with Thai language character being tightly placed next to each other without spaces in between. We suggest using the method of implementing Custom Ragas metrics as it process the evaluation output by utilizing LLMs capability. Which means that we can ignore the condition of Thai language’s syntax and process the output more efficient than BLEU, ROGUE score method due to the BLEU and ROGUE scoring method relying heavily on N-grams.

On the other hand, as we know that BLEU and ROGUE score relies heavily on N-grams. BLEU and ROGUE score might be more suitable to utilize these score to evaluate the content which are generated in other languages that have space between words instead(eg. English, Spanish and etc.).

References: https://medium.com/@sthanikamsanthosh1994/understanding-bleu-and-rouge-score-for-nlp-evaluation-1ab334ecadcb
https://docs.ragas.io/en/latest/concepts/metrics/answer_relevance.html
https://docs.ragas.io/en/latest/concepts/metrics/faithfulness.html

--

--