A LangChain implementation of Chain of Verification (CoVe) to reduce hallucination in LLM

James Li
8 min readOct 9, 2023

--

The CoVe 4-steps process implementation using LangChain, showing the input, output variables

Update (2023–10–22): added search agent to the verification step

Background

In the recently released paper Chain-of-Verification Reduces Hallucination in Large Language Models, the authors show how Chain-of-Verification (CoVe) can reduce hallucination through a 4-steps process:

  1. Generate baseline response (query LLM)
  2. Plan verifications (given query and baseline response, generate a list of questions that helps verifying any mistakes in the original response)
  3. Execute verifications (answer each verification question, check against original response for inconsistency / mistakes)
  4. General final verified response (generate a revised response incorporating the results from the verification step if there are any inconsistencies)

This blog post shows how it can be implemented step-by-step using LangChain. The complete Google Colab can be accessed here.

Implementation

Input, output and LLM calls for the Chain of Verification 4-step process

0. Import libraries

import os
from langchain import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.chains import SequentialChain, LLMChain
from langchain.output_parsers import PydanticOutputParser
from langchain.pydantic_v1 import BaseModel, Field

os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
llm = ChatOpenAI(temperature=0, model_name="gpt-4")

query = "List 5 UK politicians born in London"

1. Generate baseline response

The baseline response is a simple Question, Answer LLM query:

input_variables = ["query"]
base_response_output_key = "base_response"
base_response_template = """Question: {query} Answer:"""
base_repsonse_prompt_template = PromptTemplate(
input_variables=input_variables, template=base_response_template
)
base_response_chain = LLMChain(
llm=llm, prompt=base_repsonse_prompt_template, output_key=base_response_output_key
)

2. Plan verifications

Given the query and baseline response, generate verification questions that allow us to test the factual claims of the baseline response.

For ease of processing in the next steps, I included format_instructions in the prompt and used PydanticOutputParser from LangChain to get a structured answer.

plan_verifications_template = """
Given the below Question and answer, generate a series of verification questions that test the factual claims in the original baseline response.
For example if part of a longform model response contains the statement “The Mexican–American War
was an armed conflict between the United States and Mexico from 1846 to 1848”, then one possible
verification question to check those dates could be “When did the Mexican American war start and
end?”

Question: {query}
Answer: {base_response}

<fact in passage>, <verification question, generated by combining the query and the fact>

{format_instructions}
"""

class PlanVerificationsOutput(BaseModel):
query: str = Field(description="The user's query")
base_response: str = Field(description="The response to the user's query")
facts_and_verification_questions: dict[str, str] = Field(
description="Facts (as the dictionary keys) extracted from the response and verification questions related to the query (as the dictionary values)"
)

plan_verifications_output_parser = PydanticOutputParser(
pydantic_object=PlanVerificationsOutput
)
plan_verifications_prompt_template = PromptTemplate(
input_variables=input_variables + [base_response_output_key],
template=plan_verifications_template,
partial_variables={
"format_instructions": plan_verifications_output_parser.get_format_instructions()
},
)
plan_verifications_chain = LLMChain(
llm=llm,
prompt=plan_verifications_prompt_template,
output_key="output",
output_parser=plan_verifications_output_parser,
)

The 2 chains (base_response_chain and plan_verifications_chain can then be run sequentially:

answer_and_plan_verification = SequentialChain(
chains=[base_response_chain, plan_verifications_chain],
input_variables=["query"],
output_variables=["output"],
verbose=True)

intermediate_result = answer_and_plan_verification.run(query)

Output

intermediate_result.base_response
# 1. Boris Johnson 2. David Cameron 3. Sadiq Khan 4. Jeremy Corbyn 5. Theresa May

intermediate_result.facts_and_verification_questions
"""
{'Boris Johnson is born in London': 'Was Boris Johnson born in London?',
'David Cameron is born in London': 'Was David Cameron born in London?',
'Sadiq Khan is born in London': 'Was Sadiq Khan born in London?',
'Jeremy Corbyn is born in London': 'Was Jeremy Corbyn born in London?',
'Theresa May is born in London': 'Was Theresa May born in London?'}
"""

3. Execution verifications

The paper mentioned 4 variants to execute verifications: Joint, 2-Step, Factored, Factored + Revise.

I opted for the Factored method: answering the verification questions separately in different LLM calls, instead of the Joint or 2-Step method. This is to ensure the verification answers are not influenced by a) the baseline response and b) the answers from other verification questions.

In this step, you may choose to use internet search by using a LangChain agent with search tool (e.g. Serper API) to improve accuracy (see 3a.). The below used the same LLM to answer the verification questions.

claimed_facts = list(intermediate_result.facts_and_verification_questions.keys())
verification_questions = list(
intermediate_result.facts_and_verification_questions.values()
)
verify_results_str = ""
verify_input_variables = ["question"]
verify_output_key = "answer"
verify_template = """{question}"""

verify_prompt_template = PromptTemplate(
input_variables=verify_input_variables, template=verify_template
)

verify_chain = LLMChain(
llm=llm, prompt=verify_prompt_template, output_key=verify_output_key
)

# Answering verification questions independently
for i in range(len(verification_questions)):
claimed_fact = claimed_facts[i]
question = verification_questions[i]
answer = verify_chain.run(question)
answer = answer.lstrip("\n")
verify_results_str += f"Question: {question}\nAnswer: {answer}\n\n"

Output

print(verify_results_str)

"""
Question: Was Boris Johnson born in London?
Answer: No, Boris Johnson was born in New York City, United States.

Question: Was David Cameron born in London?
Answer: Yes, David Cameron was born in London, England.

Question: Was Sadiq Khan born in London?
Answer: Yes, Sadiq Khan was born in London, England.

Question: Was Jeremy Corbyn born in London?
Answer: No, Jeremy Corbyn was not born in London. He was born in Chippenham, England.

Question: Was Theresa May born in London?
Answer: No, Theresa May was born on October 1, 1956 in Eastbourne, Sussex, England.
"""

3a. Execute verifications using Search Agent

You can also use internet search to get answers to the verification questions. Here’s an example on how you can use the DuckDuckGo search agent.

I have added a custom system message to encourage the LLM to use the search tool.

As an alternative, you can input the verification question directly to a search API (instead of using an agent). The reason I used a search agent here is because the agent will rephrase the question to a search query and it will keep on searching n-times until it finds an answer.

from langchain.agents import ConversationalChatAgent, AgentExecutor
from langchain.tools import DuckDuckGoSearchResults

search = DuckDuckGoSearchResults()
tools = [search]
custom_system_message = "Assistant assumes no knowledge and relies on internet search to answer user's queries."
max_agent_iterations = 10

chat_agent = ConversationalChatAgent.from_llm_and_tools(
llm=llm, tools=tools, system_message=custom_system_message
)
search_executor = AgentExecutor.from_agent_and_tools(
agent=chat_agent,
tools=tools,
return_intermediate_steps=True,
handle_parsing_errors=True,
max_iterations=max_agent_iterations,
)

verify_results_with_agent = []
verify_results_with_agent_str = ""
for i in range(len(verification_questions)):
claimed_fact = claimed_facts[i]
question = verification_questions[i]

chain_input = {"input": question, "chat_history": []}
search_result = search_executor(chain_input)

answer = search_result["output"]
search_result_intermediate_steps = search_result["intermediate_steps"]
verify_results_with_agent.append(answer)
verify_results_with_agent_str += f"Question: {question}\nAnswer: {answer}\n\n"

Output

print(verify_results_with_agent_str)
"""
Question: Was Boris Johnson, the UK politician, born in London?
Answer: Boris Johnson, the UK politician, was not born in London. He was born in New York City on June 19, 1964.

Question: Was David Cameron, the UK politician, born in London?
Answer: Yes, David Cameron, the UK politician, was born in Marylebone, London.

Question: Was Sadiq Khan, the UK politician, born in London?
Answer: Yes, Sadiq Khan, the UK politician, was born in London. Specifically, he was born in Tooting, South London.

Question: Was Jeremy Corbyn, the UK politician, born in London?
Answer: Jeremy Corbyn, the UK politician, was not born in London. He was born in Chippenham, Wiltshire, England.

Question: Was Theresa May, the UK politician, born in London?
Answer: Theresa May, the UK politician, was not born in London. She was born in Eastbourne, Sussex.
"""

4. Generate final response

The final response is generated by giving the LLM:

  • Original query
  • Baseline response
  • The verification questions and answers

and asking it to revise the response taking into account the verification questions and answers, whether they are consistent or not.

final_response_input_variables = ["query", "base_response", "verify_results"]
final_response_template = """Given the ORIGINAL_QUESTION and the ORIGINAL_RESPONSE,
revise the ORIGINAL_RESPONSE (if applicable) such that it is consistent with information in VERIFIED_SOURCE.
Only keep consistent information.

<ORIGINAL_QUESTION>
{query}

<ORIGINAL_RESPONSE>
{base_response}

<VERIFIED_SOURCE>
{verify_results}

Final response:
"""
final_response_prompt_template = PromptTemplate(
input_variables=final_response_input_variables,
template=final_response_template,
)

final_response_chain = LLMChain(llm=llm, prompt=final_response_prompt_template)

final_response = final_response_chain.run(
query=intermediate_result.query,
base_response=intermediate_result.base_response,
# verify_results=verify_results_str,
# Update 2023-10-22: use results from internet search
verify_results=verify_results_with_agent_str,
)

As we can see, by incorporating the verification answers that Boris Johnson was born in New York, Theresa May in Eastbourne, and Jeremy Corbyn in Chippenham, the LLM revised its final answer to exclude those three politicians when answering user’s query.

Output

intermediate_result.base_response
# 1. Boris Johnson 2. David Cameron 3. Sadiq Khan 4. Jeremy Corbyn 5. Theresa May

final_response
# 1. David Cameron 2. Sadiq Khan

Conclusion and remarks

This blog post shows how we can implement the Chain of Verification 4-steps process using LangChain.

We first generate a baseline response, and verification questions using a sequential chain.

We then generate the verification answers on a separate LLM call for each of the questions, in order to lower the chance of other questions & answers influencing each other.

To generate the final response, we give the LLM the original question, the baseline response and ask it to revise its answer given the verification questions and answers, cross checking whether they are consistent with each other or not.

Remarks

When implementing this paper, there were some practical challenges I thought would be useful to share.

  • It is hard to generate an “appropriate” level of verification questions (i.e. there is a lot of room to interpret what is “common sense” and doesn’t need verification)
  • After the baseline response was revised — we have not answered the user’s question fully. The user asked for 5 politicians, we gave 2 (after removing 3 factually incorrect answers.)
  • When using a cheaper OpenAI model, gpt-3.5-instruct, it was difficult to generate a final answer that removes inconsistent / partially consistent answers. A lot more guidance was needed.
  • Several adjustments to the prompt were required to get the model to generate sensible verification questions. Even it may work well in this particular query, it may not work well in other queries.

My implementation follows the 4-steps process with some deviations to the paper:

  • The paper used Llama 65B, for simplicity I used GPT-4.
  • Prompts are modified to give additional guidance (few shots examples, format instructions) to the LLM such that it works with LangChain / GPT-4.
  • In the execute verifications step, the Joint, 2-step and the Factor + Revise variants are not implemented.

Resources

Appendix

Original prompts from the paper

  1. Generate baseline response (query LLM)
Q: Tell me a bio of <person>
A: <bio of person>

2. Plan verifications

Context: Q: Tell me a bio of <person>.
A: <passage about person>
Response:
<fact in passage>, Verification Question
<fact in passage>, Verification Question

3. Execute verifications

Q: Verification Question
A: Answer

4. General final verified response

Context: <Original Passage>.
From another source,
<output of execute verification step: Q + A>
<output of execute verification step: Q + A>
Response: <revised and consistent Passage>

5. (Variant) A variant to step 4, the authors proposed an alternative prompt to identify which facts are consistent — Factored + Revise.

Context: <Original Fact>.
From another source,
<output of execute verification step: Q + A>
Response: CONSISTENT. <Consistent fact>
Context: <Original Fact>.
From another source,
<output of execute verification step: Q + A>
Response: INCONSISTENT.
Context: <Original Fact>.
From another source,
<output of execute verification step: Q + A>
Response: PARTIALLY CONSISTENT. <Consistent part>

--

--