Photo by Michael Dziedzic on Unsplash

E12 : Chain-of-Verification

Praveen Thenraj
Research Papers Summarized
5 min readOct 21, 2023

--

Asking questions related to the factual response generated by LLMs help them verify the truthness in the response generated and thus reduce hallucination

Paper Name : Chain-Of-Verification redues Hallucination in Large Language Models

Paper URL : https://arxiv.org/abs/2309.11495

Authors : Meta AI - Shehzaad Dhuliawala,Mojtaba Komeili,Jing Xu,Roberta Raileanu,Xian Li,Asli Celikyilmaz,Jason Weston

Please find the annotated paper here

Hallucination :

  • Generating factually incorrect answers to questions is termed hallucination.

Problem Statement :

  • Though LLMs have improved a lot in generating coherent factual responses for human questions to certain extent, hallucination still remains an unsolved problem.

Solution :

CoVe - baseline response+plan verification+execute verification+final verified response
  • The overall solution consists of four steps : Generate baseline response, Plan verifications, Execute verifications, Generate final verified response
  • Generate baseline response - given a question, the LLM is prompted to generate a baseline response
  • Plan verifications - generate intermediate questions from the baseline response
  • Execute verifications - prompt LLM to generate responses to the verifications generated
    4 variants of execute verifications — joint, 2-step, factor, factor+revise
    Joint - responses to intermediate question is generated by considering baseline response and intermediate question
    2-step - responses to intermediate question is generated by only considering the intermediate question
    factor - prompting the model to generate response to intermediate questions without any interference of previous question answer context
    factor+revise - similar to factor but has an additional prompt to cross-check the consistency of the intermediate response generated by LLM against the baseline response to generate the final response
  • Generate final verified response - generate final response to the question from all the intermediate responses

Experimentation :

  • CoVe - Llama 65B with few-shot setting (no instruction tuned) + CoVe based verification of facts
  • CoVe was evaluated on 4 evaluation benchmarks - Wikidata,Wikicategory,MultispanQA,Longform Generation of Biographies
  • Open-source pre-trained Llama 65B model was used for the experimentation of CoVe technique.
  • Wikidata, Wiki-category list - Llama2 70B chat (instruction tuned,zero-shot), Llama2 70B chat (instruction tuned,CoT), Llama 65B(no instruction tuned,few-shot), Llama65B with few-shot CoVe(joint,2-step,factored)
  • MultispanQA - Llama2 70B chat (zero-shot), Llama2 70B chat (CoT), Llama 65B(few-shot), Llama65B with few-shot CoVe (joint,factored)
  • Longform Generation of Biographies - ChatGPT,InstructGPT,Perplexity AI,Llama2 70B chat (zero-shot), Llama2 70B chat (CoT), Llama 65B(few-shot), Llama65B with few-shot CoVe (joint,factored,factor+revise)

Observations :

  • Llama65B with few-shot CoVe(2-step,factored) was able to outperform precision of instruction tuned Llama70B chat with zero-shot and COT setting in both Wikidata and Wikicategory list benchmarks
  • Llama65B with few-shot CoVe(joint,factored) was able to outperform precision of instruction tuned Llama70B chat with zero-shot and COT setting on MultispanQA benchmark
Wikidata and Wiki-category list
MultispanQA
  • Llama65B with few-shot CoVe(factored,facto+revise) was able to outperform InstructGPT (instruction tuned), ChatGPT (instruction tuned), Llama2 70B chat with zero-shot and CoT setting
  • Overall results show that the performance of a plain pre-trained Llama 65B model with few-shot examples was alleviated with reduced hallucinations by simply including CoVe based verification in their pipeline.
  • The above point proves the hypothesis that, LLMs require more self-verification kind of approach to reduce hallucination rather than instruction fine-tuning or CoT approaches
  • In longform generation of biographies, Llama65B with few-shot CoVe(factor+revise) was even able to outperform PerplexityAI model which uses RAG in their pipeline to retrieve information to generate final response
Longform generation of biographies
  • Longform generation of biographies questions were split into 5 categories based on the distribution of occurrences of similar questions - very rare, rare, medium, frequent, very frequent
  • Results show that Perplexity AI was able to outperform CoVe enabled Llama only in conditions where questions were from very rare distribution range in the training set. These tail end distribution training data required an external tool like RAG to assist to reduce hallucination
InstructGPT,ChatGPT,PerplexityAI Vs Llama65B with CoVe based on rarity of data
  • In longform generation task, ChatGPT (175B) was never able to outperform Llama 65B with CoVe. This clearly proves that size of a model or instruction tuning of a model clearly does not have a place in technique to reduce hallucinations
  • Verification questions generated by the LLM as part of CoVe execute verification process was compared against yes/no template questions framed from the responses and heuristics in CoVe execute verification process. Results clearly show that LLM generated questions helped to reduce the hallucinations rather than yes/no questions and heuristics.

Conclusion :

  • Given the rapid pace of improvement in LLMs, controlling the models to provide more valid factual responses plays major role in the quality of the output generated by LLMs.
  • One major advantage of CoVe is that it does not require any additional training or fine-tuning of models to reduce hallucinations rather can be handled via prompts.
  • A simple self-verify technique that enhances the performance of LLMs in generating more valid factual responses that is more interpretable as intermediate questions and responses are accessible to users to understand model behaviour
  • CoVe combined along with RAG like tools can be even more powerful as models assisted only by CoVe are still susceptible to lack of training data that the model has seen

--

--