E12 : Chain-of-Verification

Published in

Research Papers Summarized

5 min readOct 21, 2023

Asking questions related to the factual response generated by LLMs help them verify the truthness in the response generated and thus reduce hallucination

Paper Name : Chain-Of-Verification redues Hallucination in Large Language Models

Paper URL : https://arxiv.org/abs/2309.11495

Authors : Meta AI - Shehzaad Dhuliawala,Mojtaba Komeili,Jing Xu,Roberta Raileanu,Xian Li,Asli Celikyilmaz,Jason Weston

Please find the annotated paper here

Hallucination :

Generating factually incorrect answers to questions is termed hallucination.

Problem Statement :

Though LLMs have improved a lot in generating coherent factual responses for human questions to certain extent, hallucination still remains an unsolved problem.

Solution :

CoVe - baseline response+plan verification+execute verification+final verified response

The overall solution consists of four steps : Generate baseline response, Plan verifications, Execute verifications, Generate final verified response
Generate baseline response - given a question, the LLM is prompted to generate a baseline response
Plan verifications - generate intermediate questions from the baseline response
Execute verifications - prompt LLM to generate responses to the verifications generated
4 variants of execute verifications — joint, 2-step, factor, factor+revise
Joint - responses to intermediate question is generated by considering baseline response and intermediate question
2-step - responses to intermediate question is generated by only considering the intermediate question
factor - prompting the model to generate response to intermediate questions without any interference of previous question answer context
factor+revise - similar to factor but has an additional prompt to cross-check the consistency of the intermediate response generated by LLM against the baseline response to generate the final response
Generate final verified response - generate final response to the question from all the intermediate responses

Experimentation :

CoVe - Llama 65B with few-shot setting (no instruction tuned) + CoVe based verification of facts
CoVe was evaluated on 4 evaluation benchmarks - Wikidata,Wikicategory,MultispanQA,Longform Generation of Biographies
Open-source pre-trained Llama 65B model was used for the experimentation of CoVe technique.
Wikidata, Wiki-category list - Llama2 70B chat (instruction tuned,zero-shot), Llama2 70B chat (instruction tuned,CoT), Llama 65B(no instruction tuned,few-shot), Llama65B with few-shot CoVe(joint,2-step,factored)
MultispanQA - Llama2 70B chat (zero-shot), Llama2 70B chat (CoT), Llama 65B(few-shot), Llama65B with few-shot CoVe (joint,factored)
Longform Generation of Biographies - ChatGPT,InstructGPT,Perplexity AI,Llama2 70B chat (zero-shot), Llama2 70B chat (CoT), Llama 65B(few-shot), Llama65B with few-shot CoVe (joint,factored,factor+revise)

Observations :

Llama65B with few-shot CoVe(2-step,factored) was able to outperform precision of instruction tuned Llama70B chat with zero-shot and COT setting in both Wikidata and Wikicategory list benchmarks
Llama65B with few-shot CoVe(joint,factored) was able to outperform precision of instruction tuned Llama70B chat with zero-shot and COT setting on MultispanQA benchmark

Llama65B with few-shot CoVe(factored,facto+revise) was able to outperform InstructGPT (instruction tuned), ChatGPT (instruction tuned), Llama2 70B chat with zero-shot and CoT setting
Overall results show that the performance of a plain pre-trained Llama 65B model with few-shot examples was alleviated with reduced hallucinations by simply including CoVe based verification in their pipeline.
The above point proves the hypothesis that, LLMs require more self-verification kind of approach to reduce hallucination rather than instruction fine-tuning or CoT approaches
In longform generation of biographies, Llama65B with few-shot CoVe(factor+revise) was even able to outperform PerplexityAI model which uses RAG in their pipeline to retrieve information to generate final response

Longform generation of biographies questions were split into 5 categories based on the distribution of occurrences of similar questions - very rare, rare, medium, frequent, very frequent
Results show that Perplexity AI was able to outperform CoVe enabled Llama only in conditions where questions were from very rare distribution range in the training set. These tail end distribution training data required an external tool like RAG to assist to reduce hallucination

InstructGPT,ChatGPT,PerplexityAI Vs Llama65B with CoVe based on rarity of data

In longform generation task, ChatGPT (175B) was never able to outperform Llama 65B with CoVe. This clearly proves that size of a model or instruction tuning of a model clearly does not have a place in technique to reduce hallucinations
Verification questions generated by the LLM as part of CoVe execute verification process was compared against yes/no template questions framed from the responses and heuristics in CoVe execute verification process. Results clearly show that LLM generated questions helped to reduce the hallucinations rather than yes/no questions and heuristics.

Conclusion :

Given the rapid pace of improvement in LLMs, controlling the models to provide more valid factual responses plays major role in the quality of the output generated by LLMs.
One major advantage of CoVe is that it does not require any additional training or fine-tuning of models to reduce hallucinations rather can be handled via prompts.
A simple self-verify technique that enhances the performance of LLMs in generating more valid factual responses that is more interpretable as intermediate questions and responses are accessible to users to understand model behaviour
CoVe combined along with RAG like tools can be even more powerful as models assisted only by CoVe are still susceptible to lack of training data that the model has seen

E12 : Chain-of-Verification

Written by Praveen Thenraj