Hidden Roots Of LLM & RAG Hallucinations

Deltaaruna
Effectz.AI
Published in
13 min readJun 7, 2024

1. Introduction

What do we know about hallucinations? Why do LLMs and RAG systems hallucinate? Is it a bug? Or a feature? When we think about hallucinations, it is very important to understand why it is occurring. Lets dig deep into the hallucinations and discover the origins!

Based on the paper Unfamiliar Finetuning Examples Control How Language Models Hallucinate, when large language models (LLMs) encounter unfamiliar inputs during testing, their predictions mimic the responses associated with the unfamiliar examples in their finetuning data. This behavior is attributed to the model learning to make intelligent “blind guesses’’ based on these unfamiliar examples during fine tuning. LLMs learn to predict this intelligent “blind guess’ ‘ for unfamiliar examples during finetuning, and that they default to this prediction when faced with unfamiliar queries at test time.

So now you have some theoretical background about why LLMs hallucinate. Now let’s see what is really going on with these models in detail. To understand that Survey of Hallucination in Natural Language Generation is a great reference. Let’s see what it is saying.

According to the paper, There are two types of hallucinations.

  1. Intrinsic Hallucinations: Generated content contradicts the trained content. Here LLM is providing false information. For example, LLM generating “The first Ebola vaccine was approved in 2021” contradicts the source content which states it was approved in 2019.
  2. Extrinsic Hallucinations: Here the LLM is generating content that cannot be verified from the source content. Here LLM is inventing new content that might be both true or false.For example if a LLM is outputting “China has already started clinical trials of the COVID-19 vaccine,” when the source does not mention this.

Now we are aware of the types of hallucinations. Let’s consider the reasons. We can categorize two origins of the hallucinations. They are

  1. Hallucinations from data
  2. Hallucinations from training

2. Hallucinations from data

According to the paper the main cause of hallucination from data is source-reference divergence

2.1. Source-Reference Divergence

Source-Reference Divergence occurs when there is a mismatch or discrepancy between the source input data(The original data from which the LLM generates text. This could be any structured or unstructured input data that the model uses as the basis for generating output) and reference input data(The target text that the LLM is trained to produce given the source input. This may be a human-written text that serves as the ground truth during the training) used during the training phase.

2.1.1. Causes of Source-Reference Divergence

Heuristic Data Collection

When we construct huge datasets, heuristic methods are often used to pair source and reference textual data. This might lead to potential mismatches. For example in WIKIBIO dataset, the source is the infobox of a Wikipedia page(the table found on the right-hand side of an article, used to present a summary of information about the subject), and the reference is the first sentence of the page. But 62% of the first dataset contains additional information that is not present in the infobox. It causes a significant source-reference divergence​.

Intrinsic Nature of Certain Language generation Tasks:

Some NLG tasks can inherently involve source-reference divergence in their dataset. For example tasks that prioritize diversity in a generated output might generate responses that include additional information or context that is not available in the source. For example, in an open-domain dialogue system, it sometimes generates responses that improve user engagement by including relevant facts or subjective opinions not directly derived from dialogue history or external knowledge bases. However, such dataset characteristic might leads to hallucinations

Duplicates in the dataset

when duplicates from the dataset are not properly filtered out. duplicated examples from the pretraining corpus bias the model to favor generating repeats of the memorized phrases from the duplicated examples

2.1.2. Example in Summarization

Source Input Data: An article about the approval of the Ebola vaccine.

Example: “The first Ebola vaccine was approved by the FDA in 2019 in the US, five years after the initial outbreak in 2014. To produce the vaccine, scientists had to sequence the DNA of the Ebola virus.”

Reference Output Data: A human-written summary of the article.

Example: “The first Ebola vaccine was approved in 2019.”

If the reference output data was: “The first Ebola vaccine was approved in 2021,” there would be a divergence because the reference contains a fact (the year 2021) not supported by the source input (which states 2019).

3. Hallucination from Training and Inference

3.1. Imperfect Representation Learning

Imperfect representation learning refers to the inability of the encoder in a LLM to accurately understand and encode the input text into meaningful and correct representations. This imperfection can arise due to several factors, including the quality of the input data, the architecture of the model, and the training procedures. When the encoder fails to capture the necessary details and context from the input data, it results in learned representations that are flawed or incomplete. These flawed representations then lead to erroneous outputs when passed through the decoder during text generation. When the encoder produces imperfect representations, the generated output may include hallucinations — information that is not present in the source input. This can manifest as factual inaccuracies, irrelevant details, or completely fabricated content.

Example

Consider a source output “The first Ebola vaccine was approved by the FDA in 2019 after rigorous testing.” The encoder should capture the key details: “first Ebola vaccine,” “approved by FDA,” and “2019.”. Assume that the encoder fails to accurately capture the year of approval or the entity involved in the approval process. The LLM might output “The first Ebola vaccine was approved in 2021 by the CDC.”

3.2. Erroneous decoding

Erroneous decoding refers to the errors that occur during the decoding phase of a LLM.. The decoder’s role is to generate the target sequence (output text) from the encoded input representations. Errors in this phase can arise from multiple factors, including issues with the attention mechanism, the decoding strategy, and the overall architecture of the decoder. We can categorize it as follows.

Attention Mechanism Errors.

The attention mechanism allows the decoder to focus on specific parts of the input sequence when generating each word of the output. If the attention mechanism incorrectly identifies which parts of the input to focus on, it can lead to errors in the generated text. In a translation task, if the attention mechanism incorrectly focuses on less relevant parts of the input sentence, the resulting translation might mix up subjects, objects, or other critical details.

Decoding Strategy Issues.

Different strategies are used to generate text during decoding, such as greedy search, beam search, or sampling methods like top-k sampling. Each strategy has its strengths and weaknesses. Some strategies, particularly those that introduce randomness (e.g., top-k sampling), can lead to hallucinations if not properly managed. Top-k sampling introduces variability by randomly selecting from the top k probable next tokens, which can sometimes lead to unexpected or irrelevant text.

Exposure bias

Exposure bias arises from the discrepancy between training and inference. During training, the decoder is typically trained using teacher-forced maximum likelihood estimation (MLE), where it predicts the next token based on the ground-truth prefix sequences. However, during inference, the decoder generates the next token based on the sequences it has generated so far. This discrepancy can lead to increasingly erroneous generation over time, especially for longer sequences.

3.2.1. Examples

Training Phase: The model learns to generate summaries by using the actual previous words from the training data.

Example Source text: “The company’s profits increased by 20% in the third quarter.” The model learns to generate the next word in the summary based on the previous correct words.

Inference Phase: The model generates a summary based on its previous output.

Example: If the model starts with “The company reported,” but then incorrectly generates “losses” instead of “profits,” it might continue with “of 20% in the third quarter,” resulting in a summary that contradicts the source text.

3.3 Parametric Knowledge Bias

Parametric knowledge bias occurs when a pre-trained language model relies too heavily on the knowledge it has internalized during pre-training, rather than focusing on the specific context provided by the input data during generation. This can lead to the generation of text that, while factually correct based on general world knowledge, may not be relevant or faithful to the specific input. This bias arises because large language models are trained on a vast corpora of text, which allows them to memorize a significant amount of factual information.

LLMs are pre-trained on massive datasets that include a wide range of topics and facts. This extensive training allows the models to internalize a lot of general knowledge. While this general knowledge can be useful for many tasks, it can sometimes overshadow the specific context of the input data, leading the model to generate text based on what it “knows” rather than what it should infer from the input.

3.3.1. Example

Source Input: “The 2022 Nobel Prize in Literature was awarded to Annie Ernaux for her courage and clinical acuity in uncovering the roots, estrangements, and collective restraints of personal memory.”

Generated Output(with Parametric Knowledge Bias): “The Nobel Prize in Literature has been awarded to authors such as Gabriel García Márquez and Toni Morrison for their significant contributions to literature.”

The model correctly mentions past Nobel laureates, but it fails to focus on the specific context of the 2022 prize awarded to Annie Ernaux.

4 RAG hallucinations

Based on the above details, you might think that it is not possible to fix hallucinations in your RAG systems. But according to the paper Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts, there is that you can do so LLM can be convinced to rely on external data.

The paper investigates how LLMs respond when presented with a single piece of counter-memory (external evidence that contradicts their parametric memory). The study contrasts two methods of constructing counter-memory:

  1. Entity Substitution: A heuristic method where specific entities in the parametric memory are substituted with other entities.
  2. Generation-based: A method where coherent counter-memory is directly generated to ensure high quality and naturalness.

4.1 Entity Substitution-based Counter-memory

Here, entity substitution is used to create counter-memory by replacing the entities in the parametric memory with different entities of the same type. However, this method can result in incoherent evidence.

Let’s consider an example.

  • Parametric Memory: “Washington D.C. is the capital of the USA. It has the Washington Monument.”
  • Entity Substitution-based Counter-memory: “London, USA’s capital, has the Washington Monument.”
  • Question: “What is the capital city of USA?”
  • LLM’s Answer (based on parametric memory): “Washington D.C.”

The results showed that LLMs tend to stick to their parametric memory when faced with such counter-memory. This is attributed to the low coherence of the substituted evidence, as the context still contains strong associations with the original entity (e.g., Washington Monument and USA).

4.2 Generation-based Counter-memory

A LLMs is asked to generate counter-memory from scratch. This method results in a more coherent and convincing counter-memory. Here’s an example:

Let’s consider an example.

  • Parametric Memory: “Paris is the capital of France.”
  • Generation-based Counter-memory: “Néma is the capital of France. This is confirmed by the official government website of France, where it is listed as the capital city.”
  • Question: “What is the capital of France?”
  • LLM’s Answer (based on counter-memory): “Néma.”

4.3. Experimental Results

The paper presents results showing the distribution of LLM answers (parametric memory-based vs. counter-memory-based) for both entity substitution and generation-based methods:

  • Entity Substitution: LLMs mostly adhere to their parametric memory.
  • Generation-based: LLMs are more likely to accept the counter-memory as the correct answer.
  • Big closed-source Models (ChatGPT, GPT-4, PaLM2): They show a significant reliance on their parametric memory when using substitution-based counter-memory but exhibit a notable increase in receptiveness to generation-based counter-memory.
  • Smaller open-source models like Qwen-7B and Llama2–7B : They are more likely to accept counter-answers with both methods.
  • Larger open-source models within the same series (Llama2–70B and Vicuna-33B) : They show a higher tendency to stick to their parametric memory when substitution-based counter-memory is used but are more open to generation-based counter-memory.

For all LLMs, the memorization ratio is higher for more popular questions. This indicates that LLMs are more likely to rely on their parametric memory when answering questions about more popular entities. The higher the popularity, the stronger the reliance on internal knowledge.

  • GPT-4 : has the highest memorization ratio across all popularity levels, suggesting it has a stronger confirmation bias compared to other models.
  • ChatGPT and PaLM2 : They also show high memorization ratios but with slightly more fluctuation compared to GPT-4.
  • Llama2–7B : It has a lower memorization ratio compared to the others, indicating it is more flexible and less biased towards its internal memory.

We can summarize the findings as follows

  1. Receptiveness to Coherent Counter-memory: LLMs are highly receptive to coherent and convincing counter-memory, even if it conflicts with their parametric memory. This contradicts prior conclusions that LLMs are generally stubborn and stick to their parametric memory.
  2. Vulnerability to Misinformation: The effectiveness of generated counter-memory in misleading LLMs highlights a significant concern. LLMs can be easily deceived by well-crafted misinformation.
  3. Comparison between Methods: The generation-based method outperforms the entity substitution method in terms of convincing LLMs to accept counter-memory.

4.4 Multi-source evidence

How do LLMs behave when presented with multiple pieces of evidence? some of which may support and some may conflict with their parametric memory?This scenario is representative of real-world applications where LLMs are augmented with information from diverse sources, such as search engines or RAG systems. In this setup, multiple pieces of evidence are provided to the LLMs. The key focus is on understanding how the LLMs prioritize and integrate these pieces of evidence, especially when they contain conflicting information. The study looks into three main aspects. They are

  • The popularity of evidence
  • The order of evidence presentation
  • The quantity of supporting versus conflicting evidence.

The study identifies a confirmation bias in LLMs, particularly towards more popular knowledge. This bias is stronger for facts that the LLMs have encountered more frequently during training.

4.5 Experimental Results

  1. Evidence Preference and Popularity:

LLMs show a strong confirmation bias towards their parametric memory, especially for popular questions. For instance, GPT-4 demonstrates an 80% memorization ratio for the most popular questions, indicating a strong preference for its internal knowledge over conflicting external evidence.

Let’s consider an example.

Question: “Who is the current president of the United States?”

  • Parametric Memory: “Joe Biden is the current president of the United States.”
  • Counter-memory: “John Doe is the current president of the United States, as confirmed by the latest news articles.”

For a popular and well-known fact, such as the current president, LLMs like GPT-4 are more likely to stick to their parametric memory (Joe Biden), demonstrating a high memorization ratio due to the frequent exposure to this fact during training.

2. Order Sensitivity:

LLMs exhibit sensitivity to the order in which evidence is presented. When parametric memory is presented first, LLMs are more likely to stick to it. This order sensitivity varies across models. For example, while GPT-4 shows less fluctuation, models like PaLM2 and Llama2–7B exhibit significant changes in response depending on the order of evidence presentation. Such order sensitivity for evidence in the context may not be a desirable property for RAG.

Question: “What is the capital of Australia?”

  • First Evidence (Parametric Memory): “Canberra is the capital of Australia.”
  • Second Evidence (Counter-memory): “Sydney is the capital of Australia.”

When the parametric memory (Canberra) is presented first, LLMs are more likely to choose it as the answer. If the counter-memory (Sydney) is presented first, the response might shift, especially in less robust models like PaLM2.

3. Quantity of Evidence:

When presented with more pieces of counter-memory, LLMs can be swayed to accept the conflicting information. Conversely, when parametric memory is presented alongside counter-memory in a balanced manner, LLMs demonstrate a noticeable confirmation bias towards the parametric memory.

Question: “Is Pluto a planet?”

  • Evidence 1 (Parametric Memory): “Pluto is classified as a dwarf planet.”
  • Evidence 2 (Counter-memory): “Pluto is considered the ninth planet in the solar system.”
  • Additional Evidence (Supporting Counter-memory): Several older astronomy texts and articles also refer to Pluto as a planet.

With multiple pieces of counter-memory, LLMs might be swayed to accept the outdated classification of Pluto as a planet, especially if the counter-memory is coherent and convincingly presented.

So it is clear that the impact of evidence order on LLM behavior in multi-source evidence scenarios. The results show the importance of considering evidence presentation order when designing systems that integrate external information with LLMs. Models like GPT-4 show robustness and a strong confirmation bias, while others like PaLM2 and Llama2–7B are more influenced by the order of evidence, which could be leveraged in applications to manage information reliability and bias.

5. Important points about RAG systems

  1. Larger models are more likely to resist any disinformation presented through your RAG system, but they can be inflexible in accepting information from your RAG system.
  2. Smaller LLMs are so easily convinced by misinformation, but they are very flexible in accepting information from external RAG systems.
  3. Present your RAG data in a consistent and coherent way, so the information will be accepted by the LLMs.
  4. LLMs can even generate misinformation that can mislead themselves.
  5. LLMs like GPT4 have strong confirmation bias towards its internal knowledge. In that context, smaller models are more flexible and rely less on internal knowledge (when presented with conflicting information to the internal memory).
  6. For popular entities, LLMs like GPT have a very strong confirmation bias towards internal knowledge.
  7. LLMs show considerable sensitivity to the in which evidence is presented and different models prefer different order.
  8. LLMs choose the side with more evidence — if irrelevant information is provided, LLMs can be distracted with this information and less likely to answer based on the parametric memory. This is especially the case for smaller models.
  9. If both irrelevant and relevant data is present, LLMs can filter out irrelevant information to a certain extent.

I feel like larger models are like the Java language. Easy to use, easy to learn, you get automatic garbage collection etc. Smaller models are like C, where it is a bit difficult to learn but more flexible. Both languages have great use cases. I feel LLMs are also the same.

When you need to augment a LLM with some private data you should consider both fine-tuning and RAG carefully. If you manage to fine-tune the LLM with your data, it is less likely to make “blind guesses’’ as I have mentioned in the Introduction section. You can read this post to learn how to fine-tune ML models even in your macbook pro!

If you go with RAG, make sure you present the information in a very coherent way. This paper from Microsoft, discusses generating on the fly Chain of thought prompts based on retrieved embeddings. I think it is a good way of presenting data in a “coherent” manner.

⭐️ Follow me on LinkedIn or Twitter for updates on AI ⭐️

I’m currently the Co-Founder & CEO @ Effectz.AI. We specialize in Privacy Preserving AI Solutions & AI Consulting.

6. References

  1. https://arxiv.org/abs/2403.05612
  2. https://arxiv.org/abs/2202.03629
  3. https://arxiv.org/abs/2305.13300
  4. https://arxiv.org/abs/2311.16452
  5. https://medium.com/rahasak/fine-tune-llms-on-your-pc-with-qlora-apple-mlx-c2aedf1f607d

--

--