Evaluating LLMs for different Criteria using Langchain with examples and code

vTeam.ai
Data Science in your pocket
4 min readOct 31, 2023
Photo by JESHOOTS.COM on Unsplash

READ FULL BLOG HERE

Constructing applications utilizing LLMs encompasses various intricate elements. Among these, one of the utmost essential factors is the assurance that the results generated by your models exhibit dependability and utility across a diverse spectrum of inputs. Moreover, it is imperative that they seamlessly integrate with the other software components within your application. The pursuit of reliability commonly necessitates a harmonious blend of activities, including the design of the application, rigorous testing and assessment, and real-time monitoring checks.

LangChain provides a range of evaluator options aimed at assessing both the performance and integrity of a wide array of data. Furthermore, we aspire to foster a collaborative spirit within the community, encouraging the creation and sharing of additional evaluators that can benefit everyone’s progress and enhancement. It provides a range of evaluator types that are equipped with pre-built functionality and offer flexibility through an extensible API for tailoring them to your specific needs. Here are the various evaluator categories we offer:

  1. String Evaluators: These evaluators examine the forecasted string based on a given input, typically involving a comparison with a reference string.
  2. Trajectory Evaluators: These are employed to assess the complete course of agent actions.
  3. Comparison Evaluators: These evaluators are specifically designed for comparing predictions generated in two separate runs using a common input.

These evaluator types are versatile and can be employed in diverse scenarios, making them compatible with various chain and LLM implementations within the LangChain library.

In this blog post, we will be discussing ‘Criteria’ evaluations from String Evaluators. These evaluators provide different criteria to evaluate LLM’s response. Let’s have a look

  1. Conciseness: The quality of being short and clear, expressing essential information without unnecessary details or wordiness.
  2. Relevance: The degree to which something is related or useful to the current context or topic of discussion.
  3. Coherence: The quality of being logically consistent and connected, making sense as a whole.
  4. Harmfulness: The capacity to cause damage, injury, or negative effects to individuals, entities, or things.
  5. Maliciousness: The intention or disposition to do harm, often involving ill will or harmful actions.
  6. Helpfulness: The ability to provide aid or support, making tasks easier or solving problems effectively.
  7. Controversiality: The state of being likely to provoke disagreement or dispute due to differences in opinions or perspectives.
  8. Misogyny: Hostility, prejudice, or discrimination against women based on their gender.
  9. Criminality: The quality or state of being associated with criminal behavior or illegal activities.
  10. Insensitivity: Lack of sensitivity or consideration towards the feelings and concerns of others.
  11. Depth: The measure of how profound or extensive something is, often referring to intellectual or emotional depth.
  12. Creativity: The ability to generate new and original ideas, often resulting in innovative or artistic outcomes.
  13. Detail: Specific and minute information or features that contribute to a comprehensive understanding of a subject or object.
  14. Correctness: Checks whether the outputs meet the ground truth provided

We will be running through all these criteria and evaluating LLMs responses to different prompts on these evaluators. So let’s get started

Conciseness

Coherence

Harmfulness

Malicious

Helpfulness

Controversiality

Misogyny

Criminality

Insensitivity

Depth

Creativity

Detail

A couple of criterion we are left i.e. correctness and relevance require ground truth and hence will change the code a bit

--

--