Hallucinations: a word with a whole new meaning

Published in

Blue Orange Digital

15 min readMar 7, 2024

Introduction

LLM hallucinations and falsely generated facts have made it hard for enterprises to fully trust these solutions, as reflected in 2023’s Menlo Ventures Report, where less than 1% of cloud total expenditure across enterprises was poured on GenAI technologies. One of the most famous cases of hallucination with worldwide repercussions was that of a lawyer using a ChatGPT-generated fact in court, with him fully believing the fact was confirmed. Another is Google’s February promotional video for Bard, in which the chatbot makes an untrue claim about the James Webb Space Telescope. When unaccounted for, these errors have produced significant disturbances across companies by changing “small” details that are extremely important to the meaning/intent of a phrase or when they create “colorful” narratives around false events, making them look true to the inexperienced reader.

These concerns, errors, and otherwise dangerous effects have shifted how research is done in the space. There are two very well-defined branches around hallucination control; one revolves around Prompt Engineering through multi-step processes, and the second has to do with Developing Models that go beyond fine-tuning methodologies and instead propose novel LLM architectures to tackle hallucinations head-on. As the number of techniques is quite large, I will focus on a few that I consider interesting, briefly mentioning the methodology and proposed approach. Let’s talk about the two branches.

Form paper A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models

Prompt Engineering

Retrieval Augmented Generation Approaches

Retrieval-Augmented-Generation (RAG) improves the responses of Large Language Models (LLMs) by leveraging external, reliable knowledge sources instead of depending solely on the model’s pre-existing, possibly outdated data or built-in knowledge. This method tackles the significant challenges of ensuring the accuracy and timeliness of LLM outputs. RAG significantly reduces the occurrence of fabrications in LLMs by producing relevant, up-to-date, and verified answers, thus boosting users’ trust and providing a cost-effective strategy for developers to increase the precision and applicability of LLMs in various settings. Some of the techniques used for reducing errors are the following:

Before Generation

In this category, the information retrieval happens before the generation of AI text.

LLM-Augmenter

The paper, titled “ Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback,” proposes a three-stage approach to improve the factual accuracy and reliability of Large Language Models (LLMs):

Stage 1: Bootstrapping from a rule-based policy:

Domain experts encode their knowledge and business logic into “IF-THEN” rules, which guide the LLM’s initial behavior and response generation.

Stage 2: Knowledge consolidation:

A “Knowledge Consolidator” module is introduced. This module connects the LLM to external knowledge sources like databases and knowledge graphs. The LLM grounds its responses on this external knowledge to avoid hallucinations and factual errors.

Examples include answering questions about recent news based on real-time data or booking a restaurant table with accurate information.

Stage 3: Automated feedback and reinforcement learning:

The LLM’s performance is continuously monitored and evaluated. Automated feedback mechanisms identify factual errors or inconsistencies in the LLM’s outputs. This feedback trains the LLM through reinforcement learning, improving its factual accuracy and reliability over time.

In essence, this approach combines expert-provided knowledge, external knowledge sources, and automated feedback to mitigate factual errors and improve the overall reliability of LLMs.

FRESHPROMPT

The paper “FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation” proposes a method to address the limitation of static knowledge in large language models (LLMs). It introduces FreshLLMs, a system that leverages search engine results to dynamically refresh LLMs and improve their performance on questions requiring current knowledge.

Here’s the breakdown of the proposed approach:

Question Analysis: FreshLLMs analyze the incoming question to identify if it requires access to current information.
Search Engine Query Formulation: Based on the question, FreshLLMs formulate a query for a search engine (like Google).
Retrieval and Evidence Selection: The system retrieves relevant search results and selects evidence pieces (snippets of text) that are likely to contain the answer.
LLM Prompting: The selected evidence is used to create a prompt for an LLM. This prompt guides the LLM in focusing its generation on the retrieved information.
Answer Generation: The LLM generates an answer based on the prompt and its internal knowledge.
Answer Filtering and Confidence Estimation: FreshLLMs filter out any irrelevant or unreliable information generated by the LLM and assess the confidence score of the final answer.

This approach allows FreshLLMs to access and integrate up-to-date information from the real world, leading to potentially more accurate and relevant answers to questions about current events, factual updates, and other constantly evolving topics.

During Generation

In this category, knowledge retrieval happens at a sentence-by-sentence level, where the model goes through information retrieval while generating each sentence.

Knowledge Retrieval

The paper “A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation” proposes an approach to addressing hallucinations (factually incorrect outputs) generated by large language models (LLMs).

Here’s a breakdown of the proposed methodology:

Detection:

Confidence Estimation: The model estimates its confidence in each generated sentence.
Low-Confidence Threshold: Sentences with confidence below a certain threshold are flagged as potentially hallucinated.
External Knowledge Source: The flagged sentence is compared against an external knowledge source (e.g., knowledge base) to verify its factual accuracy.

Mitigation:

Fact-based Reranking: If the verification fails, the model reranks the remaining candidate sentences based on their factual consistency with the external knowledge source.
Diverse Reranking: Even if verification succeeds, the model diversifies the candidate sentences to reduce the chance of relying on the same patterns that led to the initial hallucination.

Decompose and Query Framework (D&Q)

The paper, titled “A Step Closer to Comprehensive Answers: Constrained Multi-Stage Question Decomposition with Large Language Models,” proposes a novel approach to improve the ability of large language models (LLMs) to answer complex, open-ended questions comprehensively.

Here’s the gist of the proposed methodology:

Question Decomposition: The approach first decomposes the complex question into a series of sub-questions that are easier for the LLM to understand and answer. Then, constraints, such as factual consistency and relevance to the original question, guide this decomposition.
Multi-Stage Answering: The LLM then answers each sub-question in a multi-stage process. In each stage, the LLM leverages its knowledge and the information gathered from previous stages to refine its answer. This iterative process allows the LLM to build a comprehensive response to the original question gradually.
Answer Concatenation and Summarization: Finally, the individual answers to the sub-questions are concatenated and summarized to form the final response to the original question. This ensures that the final answer covers all the essential aspects of the original question while maintaining coherence and consistency.

The authors argue that this constrained multi-stage decomposition approach overcomes the limitations of traditional LLM-based question-answering methods, which often need help to provide comprehensive and informative answers to complex questions. They present experimental results that demonstrate the effectiveness of their approach compared to existing methods.

After Generation

In this category, knowledge retrieval happens after LLM generates the entirety of its output.

High Entropy Word Spotting and Replacement

The paper proposes a comprehensive framework for characterizing and mitigating hallucinations in large language models (LLMs):

Hallucination Categorization:

Introduces two orientations of hallucination: Factual Mirage (FM) and Silver Lining (SL).
Further categorizes hallucinations into six types: Acronym Ambiguity, Numeric Nuisance, Generated Golem, Virtual Voice, Geographic Erratum, and Time Wrap.
Annotates the degree of hallucination (mild, moderate, alarming) based on its impact on the text.

HallucInation eLiciTation (HILT) Dataset:

Creates a publicly available dataset of 75,000 text snippets generated by 15 contemporary LLMs.
Annotates each snippet for hallucination orientation, category, and degree.

Hallucination Vulnerability Index (HVI):

Proposes a quantitative measure to evaluate and rank LLMs based on their vulnerability to hallucination.
Calculates HVI based on the occurrence and severity of different hallucination types.

Automatic Mitigation Strategy:

It starts by identifying high-entropy words in text generated by an LLM with high HVI. Then, Replace these points with text generated by an LLM with a lower HVI.

Self-Refinement Through Feedback and Reasoning

Following the response to a given prompt by an LLM, providing appropriate feedback on the output can enhance the accuracy and quality of the model’s future responses. Here are a few detailed techniques for reducing hallucinations that align with this approach.

Chain of Verification

The paper titled “Chain-of-Verification Reduces Hallucination in Large Language Models” proposes a method called chain-of-verification, and it involves the following steps:

Initial Response: The LLM generates a response to the user’s query as usual.

Chain of Verification:

The LLM is prompted with the generated response and asked to verify its factual accuracy. This involves

Fact-checking: The LLM searches for evidence supporting the claims made in the response using various sources, including internal knowledge bases and external search engines.
Confidence estimation: The LLM estimates its confidence in the veracity of each claim in the response.
Explanation generation: If necessary, the LLM explains its confidence level.

Revised Response:

Based on the verification chain results, the LLM may revise the initial response to correct factual errors. If the verification process is inconclusive, the response may be flagged as potentially unreliable.

The authors argue that this chain-of-verification approach helps LLMs deliberate on their responses and identify and correct factual inaccuracies. They evaluate their method on various tasks, including question answering and factual summarization, and report improvements in factual accuracy compared to baseline LLM models.

Here are some additional key points from the paper:

The chain-of-verification process can be iterative, with the LLM refining its response based on successive verification steps.
The method is flexible and can be adapted to different LLM architectures and verification tasks.
The authors acknowledge that the approach’s effectiveness depends on the availability and quality of external knowledge sources.

Chain of Natural Language Inference (CoNLI)

CoNLI is a two-stage framework that uses Chain-of-Thought (CoT) prompting to detect and mitigate ungrounded hallucinations. It involves the following:

Detection Agent:

Hypothesis Selection: Splits raw responses into individual sentences and selects hypotheses for detection.
Sentence-Level Detection: Uses CoT to judge each hypothesis against the source text as entailment, contradiction, or neutral. Hallucinations are defined as contradictions or neutral judgments.
Entity-Level Detection: For non-hallucinated hypotheses, conducts further CoT-based reasoning on tagged entities to identify more subtle hallucinations.

Mitigation Agent:

Mitigation Instructions: Uses detection results to generate instructions for rewriting the raw response.
Hallucination Reduction: Rewrites or removes hallucinated sentences while preserving the format and coherence of the original response.

Key Features:

Utilizes CoT prompting for interpretable and domain-agnostic hallucination detection.
Employs a hierarchical approach to handle both sentence-level and entity-level hallucinations.
Prioritizes preserving the original response while reducing hallucinations.

Developing Models

Recognizing that mitigating hallucinations requires both algorithmic advancements and improved data quality, several recent papers have focused on developing novel whole-model architectures rather than fine-tuning existing models. Some of these techniques, categorized below, offer a more holistic approach to tackling the challenge.

Introducing New Decoding Strategies

Context-aware Decoding

The paper proposes a method called Context-aware Decoding to address hallucinations.

Context-aware Decoding:

Introduces a context-aware attention contrastive ensemble during the decoding process. This mechanism:

Dynamically weights the attention given to different parts of the input context based on their relevance to the generated word.
It focuses more on the relevant context that supports the current generation step, ensuring consistency with the given information by adjusting the model output probability using the mutual information (PMI) between the context and the generated answers.
Reduces the influence of irrelevant or misleading context, preventing hallucinations.

The Architecture does the following:

Context embedding: The model learns a vector representation of the entire input context.
Adaptive weighting: The attention mechanism assigns weights to different context parts based on this embedding and the current generation state.
Knowledge-aware attention: The model can optionally incorporate knowledge bases to guide the attention mechanism further.

Decoding by Contrasting Layers (DoLa)

DoLa aims to address hallucination issues by leveraging the different knowledge sources embedded within the layers of an LLM.

Critical aspects of DoLa:

Contrasting layers: DoLa contrasts predictions from different layers of the LLM during decoding. The “mature” layer (higher layers) is assumed to encode more factual knowledge, while the “immature” layer (lower layers) is more focused on linguistic patterns.
Dynamic weighting: The model dynamically weighs the predictions from each layer based on their estimated correctness. This is achieved by comparing the predicted token probabilities across layers and adjusting them based on a plausibility constraint.
Adaptive techniques: DoLa incorporates additional techniques to improve its efficacy:

Repetition penalty: Discourages the model from repeating the same word consecutively, promoting diversity in the generated text.
Adaptive temperature: Adjusts the decoding temperature dynamically to control the exploration vs. exploitation trade-off.

Expected benefits:

Reduced hallucinations: By downplaying the influence of the immature layer and amplifying factual knowledge from the mature layer, DoLa aims to generate less factually incorrect or nonsensical text.
Improved factual consistency: The contrasting and weighting mechanism helps ensure the generated text aligns with factual knowledge.
Retention of fluency: DoLa incorporates techniques like repetition penalty to maintain the grammatical correctness and fluency of the generated text.

Utilization of Knowledge Graphs

Knowledge graphs (KGs) are structured collections of data entities (e.g., people, places, objects) along with their attributes and the relationships that bind them together. This organization facilitates comprehension of semantic meaning and relationships within the data, enabling sophisticated reasoning, data analysis, and information retrieval.

Consequently, KGs have attracted interest in the context of hallucination mitigation due to the potential of:

Fact-checking: KGs can be used to verify the factual accuracy of information, potentially preventing the generation of false or misleading statements.
Reasoning and inference: By leveraging the interconnectedness of entities, KGs can help draw logical conclusions and make informed predictions, reducing the likelihood of hallucinations.
Contextual understanding: KGs provide a richer understanding of the context surrounding information, allowing them to generate more relevant and meaningful outputs.

RHO: Reducing Hallucination in Open-domain Dialogues with Knowledge Grounding

RHO tackles the hallucination issue by leveraging knowledge grounding and response reranking techniques.

Knowledge Grounding:

Local: Entities and relations mentioned in the dialogue are linked to corresponding entries in a knowledge graph (KG). Their representations are then fused with the textual embeddings of the dialogue tokens, ensuring consistency between dialogue and knowledge.
Global: An attention mechanism analyzes the entire context-related sub-graph of the KG, not just the linked entities. This allows the dialogue tokens to “attend” to relevant information within the sub-graph, enabling multi-hop reasoning beyond direct connections.

Response Reranking:

A dedicated conversational reasoning model evaluates generated responses. Based on the dialogue context and response, it simulates traversing the relevant KG sub-graph.

Each response receives a probability score reflecting the likelihood of its traversal path matching the sub-graph. The response with the highest score is chosen as the final output, ensuring faithfulness to the knowledge base.

Implementation:

RHO is built upon the BART architecture, a powerful language model. KG embeddings are obtained using TransE, a well-established embedding technique. The conversational reasoning model employs an LSTM-based decoder.

Introducing Faithfulness-based Loss Functions

Loss Weight Method

The methodology described in the paper encompasses several critical points aimed at reducing hallucinations in cross-lingual summarization:

Cross-lingual Transfer with MAD-X:

The paper adopts the Multiple Adapters framework (MAD-X) for cross-lingual transfer, a state-of-the-art method incorporating independent language and task adapters. This method allows for the efficient transfer of the summarization capability from a source language to a target language through a sequence of steps involving training language adapters on Wikipedia corpora, training a task adapter with annotated data in the source language, and then combining the trained task adapter with the target language adapter for zero-shot inference.

Expert and Anti-Expert Approaches:

To mitigate hallucinations, the paper employs expert and anti-expert models trained on subsets of faithful and hallucinated samples. This approach steers the model towards generating more faithful summaries by tuning a base adapter with these subsets, thereby promoting positive behavior (expert models) and discouraging negative behavior (anti-expert models). Several methods within this family are explored, including Task Vector Negation, Contrastive Parameter Ensembling (CAPE), and DExpert Decoding, each manipulating the model differently to blend the influence of expert and anti-expert knowledge.

Weighted Loss Approach:

Addressing the limitations of directly filtering out hallucinated data, which might compromise summarization performance, the paper introduces a “soft” data filtering method. This approach involves weighting the training loss based on each sample’s faithfulness score, a novel strategy aiming to improve the model’s faithfulness without significantly affecting its summarization capabilities. By relying on a faithfulness metric for the source language, this method adjusts the training parameters to prioritize learning from more faithful examples, thereby reducing the propensity for generating hallucinations during cross-lingual transfer.

These components collectively form a comprehensive methodology to enhance the faithfulness of multilingual summarization models, aiming to reduce hallucinations without sacrificing performance. This approach leverages cross-lingual transfer techniques, expert and anti-expert models, and an innovative loss weighting mechanism to tackle maintaining high-quality, faithful summaries across languages.

Supervised Finetuning

Supervised fine tuning (SFT) is crucial in preparing Large Language Models (LLMs) for specific applications utilizing annotated datasets. This process ensures that the models can accurately interpret and execute human instructions for targeted functions, thereby improving the reliability of the model’s outputs. Within the framework of SFT, the caliber of the dataset is paramount since it directly impacts the enhanced model’s effectiveness. Throughout the SFT process, adjustments are made to the LLM’s parameters based on the feedback from a specialized loss function. This function evaluates the discrepancy between the model’s predictions and the actual labels. It has been notably successful in augmenting LLMs’ ability to adapt and perform well in tasks they have not encountered before.

FineTuning Large Language Models for Factuality

Proposes a methodology for improving the factual accuracy of large language models (LLMs) using fine-tuning with factual supervision. Some of the key points are:

Factual Supervision:

Leverage a large factual knowledge base (e.g., Wikidata, knowledge graphs) to generate training data.
Construct factual triples (subject, predicate, object) representing true statements.
Utilize textual entailment methods to assess the factual consistency of LLM outputs with these triples.

Fine-tuning:

Fine-tune pre-trained LLMs on the factual supervision data, optimizing for entailment accuracy. This encourages the LLM to align its internal representations with factual knowledge.

Evaluation:

Assess factual accuracy on benchmark datasets containing factual prompts and responses. Then, the fine-tuned LLM’s performance will be compared with the original and other fact-aware language models.

Key Technical Aspects:

Factual Knowledge Base: The quality and coverage of the knowledge base are crucial for generating accurate supervision.
Textual Entailment Methods: The choice of entailment method impacts the sensitivity to different types of factual errors.
Fine-tuning Objective: The specific objective function used for fine-tuning can influence the LLM’s learning process and factual grounding.

Hallucination Attribution Recitation (HAR)

The Hallucination Augmented Recitations (HAR) methodology encompasses a structured approach to generate counterfactual datasets to enhance attribution in language models. This methodology is mainly designed for open-book question-answering (QA) scenarios, focusing on generating high-quality, attributable counterfactuals that cannot be derived from the model’s pre-existing knowledge. Here are the key steps:

Recitation Generation: A recitation-augmented language model approach generates multiple documents and answer pairs for a given question. This involves using a 5-shot prompt via language models to produce open-book QA examples to induce reasoning through recitation and produces an attributable document for a given question.
Factuality Filtering: This step applies a filtering process to exclude factual answers and focus instead on counterfactual answers. This step aims to refine the dataset by removing factually accurate answers, thereby emphasizing the generation of counterfactual information that challenges the model to rely on the provided documents rather than its pre-existing knowledge.
Attribution Filtering: Further refines the dataset by removing document and answer pairs where the answer is not grounded in the document provided. This ensures that the remaining data points in the dataset require the model to attribute the answer specifically to the information in the document, enhancing the model’s ability to use attribution effectively.

This methodology results in the creation of a counterfactual dataset named CF-TriviaQA, containing 19,000 examples. The dataset is designed to improve text grounding and attribution in language models by focusing on counterfactual data. It has demonstrated significant improvements in open-book QA performance when models are fine-tuned on this dataset compared to traditional factual datasets.

Models fine-tuned with CF-TriviaQA significantly outperformed those fine-tuned with the factual dataset TriviaQA, even with a dataset four times smaller and a model size four times smaller. This improvement was observed across various out-of-domain open-book QA tasks, including multi-hop biomedical and adversarial questions.

Conclusion

The evolving landscape of Large Language Models (LLMs) in 2023 and now 2024 has brought to everybody’s attention the pressing issue of hallucinations — instances where these advanced technologies generate factually incorrect or misleading information. As highlighted through various innovative research efforts and methodologies, the field is actively seeking solutions to enhance the reliability and accuracy of LLMs. The efforts are multifaceted, from implementing Retrieval-Augmented Generation approaches and novel Prompt Engineering techniques to developing groundbreaking model architectures and knowledge-grounding strategies. These advancements aim to mitigate the risks associated with LLM-generated hallucinations and pave the way for more trustworthy and efficient AI systems. As we continue to refine these technologies, the focus remains on balancing the immense potential of LLMs with the imperative of ensuring their outputs are factual and beneficial. The journey toward achieving this balance is complex and ongoing, yet essential for the future of AI-driven innovation and its integration into diverse sectors of society.