Evaluating an answer with Generative AI & LangChain

6 min readMar 28, 2024

Questions are easy. Answers are hard.

LLMs love to generate answers to questions. However, they can also evaluate the quality of existing answers, whether those come from humans or machines.

We’ll use this as an example to illustrate how we can utilize LangChain and an LLM for this purpose. You can find the code for this experiment at: https://github.com/adnantium/answerEvallm

How it works

The Language Model (LLM) will serve as an evaluator, assessing answers based on a specific “persona” and predefined evaluation criteria. Each answer will receive a score. The output will be a well-structured JSON file that includes the scores, comments, and definitions for each evaluation criterion, along with the name of the evaluating model.

Implementing flexible answer evaluations for real-world questions across multiple subject areas using traditional human-written code would be incredibly challenging. It would require Natural Language Processing (NLP), a pre-processed data knowledge base for reference, a complex rules engine, detailed specifications from business users, and extensive testing. Despite these efforts, the quality of results may not surpass those produced by a GenAI approach.

LangChain

LangChain is a framework designed to build and deploy applications, such as chatbots and agents, that utilize language models and integrate with external tools and services.

It’s one of the most popular tools for building AI-integrated data flows, and it continues to expand daily with the addition of more advanced features and improvements. It truly is moving very fast, and keeping up-to-date on its evolution requires regular reading and experimentation.

An essential ingredient of its success is its ability to flexibly create complex prompts that can define the context and guide the language model’s responses in the right direction.

This article will cover some key aspects of LangChain:

Prompt creation
Interaction with a Language Model (LLM)
Output parsing

Additionally, I will guide you through the process of setting up the codebase and running it locally.

The Prompt

Key components of the prompt include:

Persona: The role that the AI should take when generating the response. e.g. a high school teacher, a college professor, a scientist, etc.
Mission: The primary object e.g. to provide a comprehensive answer to a question, to provide a summary of a text, etc.
Evaluation Criteria: The specific aspects of the answer that should be evaluated. e.g. completeness, correctness, grammar, etc.
Output Format: A specification for the response’s structure. e.g. a json object schema.
Question: The question that was asked.
Answer: The answer to the question that should be evaluated for quality.

An example of evaluation criteria:

Completeness: The extent to which the answer covers all aspects of the question.
Correctness: The accuracy of the information provided in the answer.
Grammar: Proper use of grammar, punctuation, correct spelling and must be full sentences.

The LLM’s Output

Output for each criterion will include:

Comments: The feedback on the quality of the answer. e.g. The answer is accurate but incomplete. Include more differences and similarities.
Score: The score based on the evaluation criteria. e.g. 80/100
Criteria Name: e.g. Correctness
Criteria Definition: e.g. The accuracy of the information provided in the answer
Evaluator Name: The name of the AI model that is evaluating the answer. e.g. gpt-4

Quality of the results

The response from ChatGPT and pretty much all other top models is non-deterministic. This creates some surprise responses and inconsistencies. These are not hallucinations. The difference is caused by the LLM following a slightly different “path” to construct its response for each request. Due to this it will occasionally give different evaluations to the same question and answer when run multiple times. This can sometimes result in a varying score, such as 80, even though the previous submission scored a 90.

That's not good. Consistency is important. Some ways to improve this:

Lower the temperature

ChatGPT uses this parameter to control the model’s “randomness and creativity”. Turning it down to 0 improved the consistency of the results significantly but still not 100%.

Be very specific about the objective

Give the model a good level of detail on the context of the problem and the success criteria for it to work towards. This was done by specifying a “persona” and “mission statement”.

Give clear definitions

Terms like “score” can have many different interpretations and meanings. Adding a detailed definition of it significantly improved the reliability of the results:

Use the following scale to score the answer for each criteria:
* 100: Excellent. The answer fully meets the criteria's definition.
* 90-99: Very Good. The answer is very good but not perfect.
* 80-89: Good. The answer has some minor issues.
* 70-79: Fair. The answer has some major issues.
* 60-69: Poor. The answer has many issues.
* 0-59: Very Poor. The answer does not meet the criteria's definition.

Give examples

This example chain is designed to be open-ended and not specialized for a particular domain or subject area. It can evaluate answers on a wide range of topics, from flowers and cars to impressionist painters and stoic philosophers. Essentially, it can handle any subject area that you pose a question and answer.

Surprisingly, providing any specific examples causes more harm than good in this use case. It ends up making the model over-focus on generating responses that look similar to the examples given to it. The responses get “polluted” by the examples and lower the quality of the final results. I’m sure there are ways to get around this by asking or tricking it in the right way. This requires more experimentation.

Setup

Setting up the codebase and running the server is pretty straightforward.

You will need Python, Poetry, and Git to get started.

> git clone <https://github.com/adnantium/answerEvallm.git>
	...
> cd answerEvallm/
> poetry install
	...

Start up a LangServe server which is essentially a wrapper around FastAPI (I love FastAPI)

> export OPENAI_API_KEY='abc123-you-key-from-openai-keep-it-safe'
> python answer_eval_server.py
	...

Next we’ll hit the demo page and confirm that all is good: http://localhost:8000/qa . You should see a basic form with question and answer fields. Give it a try. Note that getting a response from OpenAI takes time — usually about 5 secs but sometimes 10 secs so be patient.

You can also go to http://localhost:8000/answer_eval/playground/ where you can enter the same info to get a debug-level view into the chain’s activity and its intermediate inputs and outputs.

Under the hood

We utilized the PromptTemplate class to build the final text for the prompt that will be sent to the LLM. The essential bit of code is:

prompt = PromptTemplate(
    template=template,
    input_variables=["question", "answer"],
    partial_variables={
        "format_instructions": output_parser.get_format_instructions(),
        "criteria_list_text": build_criterion_list_text(criterion),
        "evaluator_name": model.model_name })

input_variables vs partial_variables

LangChain distinguishes between two types of variables for good reasons.

input_variables (such as question and answer) are provided to the application before the prompt is executed. They serve as the initial data for the model's processing and are expected to be accurate and complete.
In contrast, partial_variables are not fully defined or known when the request is received. Their values are more dynamic and evolve as we go further in the problem-solving or reasoning process.

The evaluations will contain scores for each criterion and comments on each aspect of the answer. Other evaluation criteria can be added as needed such as relevance, coherence, conciseness, humor, etc.

Parsing the Output

The LLM produces JSON data as plain text which is used to create a Pydantic object. This object contains the evaluation results, including criteria scores, comments, and definitions. The schema for this object was specified as an input into the PromptTemplate using the parser’s get_format_instructions()

The LangChain framework, when used in conjunction with a Large Language Model, provides a highly capable solution for evaluating the quality of an answer to a question in almost any domain or subject area that the model has been trained on. It allows for a clear definition of evaluation criteria, facilitates easy configuration, simplefied connecting to the LLM, structured output formatting, and integration with supporting tools.

The takeaway from this experiment is that LLMs still face challenges with consistency and reliability in their responses. Although this is rapidly improving with newer models, we have yet to reach a stage where an LLM (off the shelf without fine-tuning) can be used for critical decision-making. This primarily restricts its uses for anything that impacts people’s finances or health. Humans love money and fear death so we still need to build more trust in LLMs — and I’m sure they will continue to earn it.