Custom Evaluators for LLM using Langchain with codes and example

vTeam.ai
Data Science in your pocket
5 min readNov 22, 2023
Photo by Volodymyr Hryshchenko on Unsplash

We have been talking a lot about GenAI and stuff in our past blogs. We have also talked about how to evaluate the performance of these LLMs in one of our previous post. In this post, going a step ahead, we will talk about how to build a custom evaluator for deciphering LLM results.

Our previous post on predefined string evaluators provided by Langchain

String evaluators for LLMs using Langchain

In this post, we will be discussing Custom evaluators for both Supervised & Un-Supervised problem statements. But before that let’s create a LLMChain that we would be evaluating. This would be a teacher giving an answer for anything asked

For codes, follow this post : VTeam | Custom Evaluators for LLM using Langchain with codes and example

Now, as our agent is ready, let’s talk about custom evaluators for Supervised problems

Supervised

Score 1: The answer is completely unrelated to the reference.

Score 3: The answer has minor relevance but does not align with the reference.

Score 5: The answer has moderate relevance but contains inaccuracies.

Score 7: The answer aligns with the reference but has minor errors or omissions.

Score 10: The answer is completely accurate and aligns perfectly with the reference.”””

prompt = ‘How to make tea?’

evaluator = load_evaluator(“labeled_score_string”,criteria=score_criteria, llm=ChatOpenAI(model=”gpt-4",openai_api_key=api_key))

prediction = chain.run(prompt)

eval_result = evaluator.evaluate_strings(

reference=”To make tea, boil fresh water, pour it over tea leaves or a tea bag in a teapot or teacup,\

let it steep for the recommended time (varies by tea type),as and then remove the tea leaves or tea bag”,

print(‘LLMs answer:’,’\n’.join(prediction.split(‘.’)))

print(‘\n’.join(eval_result[‘reasoning’].split(‘.’)

The above code snippet creates a metric called ‘accuracy’ which is like a scoring system giving different scores to the prediction made depending upon the relevance with the ground truth. Let’s understand this line by line

  • score_criteria is a dictionary declare which has the metric name ‘accuracy’ as key and the score-card depending on the quality of prediction
  • Using load_evaluator(), we are loading a evaluator function passing score_criteria as criteria to evaluate
  • Using the evaluator object, we pass the prediction made and the ground truth as ‘reference’.
  • The results go in the ‘reasoning’ key of eval_result variable

See the results for yourself for a clear understanding

As you can see, the evaluator gives a descriptive reply to the quality of the prediction against the ground truth alongside the rating from the custom-scorecard we designed.

Let’s try one more with the same custom accuracy metric

prompt = ‘Which team has won 2nd most Cricket world cups, including both T20 and ODI and how many in each category?’

As you can see, as the LLM gave a wrong answer, the rating gone down !!

Moving onto

Un-Supervised

As a unsupervised problem statement doesn’t have a ground truth, we will have custom criteria/group of custom criteriae and a rating based on that.

“Spiritual”: “The assistant’s answer should be on deeper sense of purpose and meaning in life..”,

prompt = ‘Which team has won 2nd most Cricket world cups, including both T20 and ODI and how many in each category?’

evaluator = load_evaluator(“score_string”,criteria=custom_criteria, llm=ChatOpenAI(openai_api_key=api_key))

prediction = chain.run(prompt)

eval_result = evaluator.evaluate_strings(

print(‘LLMs answer:’,’\n’.join(prediction.split(‘.’)))

print(‘\n’.join(eval_result[‘reasoning’].split(‘.’)))

As you can see, we have slightly changed the code compared to the one used in Supervised problem

  • In load_evaluator(), we are passing score_string and not labelled_score_string as in supervised
  • No ground truth is passed. So the evaluation won’t be objective

Rest of the things remain the same.

As you can see, as the answer given is factual, the rating is quite low.

Let’s try this custom evaluator with a combination of custom criteria keeping everything else same. In multiple criteria, the final rating will be based on an average rating for all the criteria mentioned. Let’s have a look

“Spiritual”: “The assistant’s answer should be on deeper sense of purpose and meaning in life..”,

“Involves numbers” : “The assistant’s answer should have numbers”,

“requires internet”: “The assistant’s asnwer should have facts and require external resources”

For the same question

Here you can read how 2 of the 3 criteria were met hence the rating is not as bad as the previous one.

Concluding this code heavy post, we’ve uncovered the exciting possibilities this framework brings to both supervised and unsupervised problem-solving. By delving into the LangChain ecosystem, we’ve empowered ourselves to tailor evaluations to our specific needs, fostering a more nuanced understanding of language models. LangChain’s user-friendly approach makes coding custom evaluators a breeze. As we bid adieu to this exploration, let’s celebrate the newfound ability to create evaluators that align seamlessly with our unique requirements. LangChain’s commitment to simplifying the coding process has undoubtedly made our journey enjoyable and productive. Here’s to the exciting world of custom evaluators and the endless possibilities they unlock!

Disclaimer: The views and opinions expressed in this blog post are solely those of the authors and do not reflect the official policy or position of any of the mentioned tools. This blog post is not a form of advertising and no remuneration was received for the creation and publication of this post. The intention is to share our findings and experiences using these tools and is intended purely for informational purposes.

Originally published at https://vteam.ai.

--

--