Hallucination in Large Language Models (2023)

12 min readOct 22, 2023

In the context of large language models like GPT-3, “hallucination” refers to a phenomenon where the model generates content that is not accurate, factual, or contextually relevant.

Hallucinations occur when the model produces information or responses that seem plausible on the surface but are actually incorrect, fictional, or not grounded in reality. This can be unintentional and often results from the limitations and biases in the training data and the model architecture. Here’s a more detailed explanation:

Why does hallucination happen in large language models?

Hallucination can occur for several reasons:

1. Training Data Bias: Large language models are trained on vast amounts of text from the internet, which can contain misinformation, stereotypes, and biases. These biases may lead the model to generate content that aligns with those biases but is factually incorrect.

2. Over-optimization: During training, models like GPT-3 are optimized to produce coherent and contextually relevant text. This optimization sometimes leads them to make up information that fits the context, even if it’s not true.

3. Absence of External Verification: The models lack the ability to verify information from external sources. They rely on the training data and don’t have access to real-time, fact-checking databases.

4. Contextual Inference: Language models infer context from preceding text, but they might misinterpret or extrapolate incorrectly, leading to hallucinations.

Types of Hallucination in LLM:

LLM Hallucinations can be categorized into different types:

Factual Hallucination:
Factual hallucination occurs when an LLM generates content that is factually incorrect or fictional. LLMs may produce information that seems accurate but is not based on real-world facts or data.
— Example 1:
In response to a question about historical events, an LLM might inaccurately state, “World War II started in 1776.” This is a factual hallucination because it presents a factually incorrect date for an important historical event.
— Example 2:
When asked about the chemical composition of water, the LLM might generate the response, “Water is composed of carbon and nitrogen.” This is factually incorrect, as water is actually composed of hydrogen and oxygen.

2. Contextual Hallucination:
Contextual hallucination happens when an LLM generates content that appears relevant to the context but is still incorrect. The generated content may fit the conversation but contains inaccuracies.

— Example 1:
Suppose a user asks, “Tell me about the life of Albert Einstein.” The LLM could create a detailed but entirely fictional biography of a non-existent person, despite the request for information about the real historical figure.
— Example 2:
When asked about tourist destinations in Paris, the LLM may describe an imaginary place, such as “The Enchanted Chocolate Forest of Paris,” instead of providing real tourist spots.

3. Stereotype Reinforcement:
Stereotype reinforcement occurs when an LLM unintentionally perpetuates stereotypes or biases present in its training data.
— Example 1:
In response to a question about career choices, the LLM might unintentionally reinforce a gender stereotype by stating, “Nursing is a job more suitable for women,” even though this is an outdated and inaccurate stereotype.
— Example 2:
When asked about the leadership qualities of individuals from certain ethnic backgrounds, the LLM might generate a response that perpetuates racial biases.

4. Parroting:
Parroting is when the LLM simply repeats information or biases present in its training data without critical analysis. Instead of generating original or thoughtful content, the model echoes what it has learned from its training data. An LLM might echo a biased statement it encountered during training without challenging its accuracy or ethical implications.

— Example : If a controversial statement, present in the model’s training data, is repeated without critical analysis, the LLM may affirm a falsehood, such as “Vaccines cause autism.”

5. Misinformation Propagation:
Misinformation propagation involves the dissemination of false or misleading information. LLMs may inadvertently spread misinformation due to the biases in their training data.

— Example 1: If a user inquires about health advice, the LLM might provide information endorsing a pseudo-scientific cure for a serious disease, potentially putting the user’s health at risk.
— Example 2: In response to a query about climate change, the LLM might spread misinformation about the causes or effects of climate change.

6. Self-contradiction:
Self-contradiction occurs when the LLM generates responses that contradict its own statements. The model might generate content that contradicts what it previously stated in the same response.

— Example 1: An LLM might say, “The sky is always blue,” and in the same response, “The sky is often gray.” This is self-contradiction.

— Example 2: In a single response, the LLM might assert, “The Earth is flat,” and immediately contradict itself with, “The Earth is a sphere.” This self-contradictory behavior can confuse users and undermine the model’s credibility.

7. Over-extrapolation:
Over-extrapolation happens when the LLM makes unwarranted predictions or generalizations. The model may draw conclusions that go beyond the scope of the provided information.
— Example: If an LLM predicts future events with unwarranted certainty, such as “The stock market will crash next week,” it’s over-extrapolating. or “There will be 10 inches of rain precisely on July 7, 2023,” which is an unjustified over-extrapolation.

Addressing these types of hallucinations in LLMs is a challenging task that involves refining training data, implementing better fine-tuning practices, and encouraging critical thinking and ethical AI development to reduce the occurrence of these issues.

How to Solve LLM Hallucination:

Solving hallucination in large language models is a challenging task, but there are steps that can be taken to mitigate it:

1. Fine-tuning:

Fine-tuning a large language model (LLM) response involves configuring various parameters to shape the generated output. These parameters can be fine-tuned when using the OpenAI API for text generation. Here are some commonly used parameters along with Python code examples:

Max Tokens: This parameter limits the length of the generated response in tokens.

Temperature: Temperature affects the randomness of the output. A higher value (e.g., 0.8) makes the output more random, while a lower value (e.g., 0.2) makes it more deterministic.

Top-p (nucleus) Sampling: This parameter controls the diversity of the output by limiting the probability mass to the top-p most likely tokens.

Frequency Penalty: You can penalize the generation of repetitive or frequently occurring tokens.

Presence Penalty: Penalize the generation of specific tokens to control their presence in the output.

Engine: Specify the language model engine to use, e.g., “text-davinci-002” for GPT-3.

N (Number of Responses): You can request multiple responses and choose the best one.

User Tokens: This allows you to provide additional context to the model in the form of tokens.

Here’s few examples of how to use these parameters with the OpenAI API:

import openai

prompt = "Translate the following English text to French: 
'The quick brown fox jumped over the lazy dog.'"

max_tokens = 50
temperature = 0.7
top_p = 0.8
freq_penalty = 0.5
presence_penalty = 0.2
engine = "text-davinci-002"
n = 1

response = openai.Completion.create(
 engine=engine,
 prompt=prompt,
 max_tokens=max_tokens,
 temperature=temperature,
 top_p=top_p,
 frequency_penalty=freq_penalty,
 presence_penalty=presence_penalty,
 n=n,
)
print(response.choices[0].text)

Example 2

import openai

# Set the prompt to explain the concept of gravity
prompt = "Explain the concept of gravity."

# Configure various parameters
max_tokens = 60  # Limit the response to 60 tokens
temperature = 0.5  # Balance randomness and determinism
top_p = 0.7  # Limit the output to top 70% likely tokens
freq_penalty = 0.2  # Penalize repetitive words
presence_penalty = 0.3  # Penalize the presence of specific words
engine = "text-davinci-002"  # GPT-3 engine
n = 3  # Request three different responses

responses = []

for _ in range(n):
    response = openai.Completion.create(
        engine=engine,
        prompt=prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        frequency_penalty=freq_penalty,
        presence_penalty=presence_penalty
    )
    responses.append(response.choices[0].text)

for i, response in enumerate(responses):
    print(f"Response {i + 1}:\n{response}\n")

Example 3

import openai

# Set a detailed prompt about climate change and its impact
prompt = "Compose an in-depth article on the causes and consequences 
of climate change, focusing on its effects on global ecosystems."

# Configure various parameters
max_tokens = 600  # Allow for a substantial amount of text
temperature = 0.5  # Balance randomness and determinism
top_p = 0.8  # Limit output to the top 80% likely tokens
freq_penalty = 0.2  # Penalize repetitive phrases
presence_penalty = 0.3  # Penalize the presence of specific words
engine = "text-davinci-002"  # GPT-3 engine

# Generate a comprehensive article on climate change and its impact
response = openai.Completion.create(
    engine=engine,
    prompt=prompt,
    max_tokens=max_tokens,
    temperature=temperature,
    top_p=top_p,
    frequency_penalty=freq_penalty,
    presence_penalty=presence_penalty
)

generated_article = response.choices[0].text

# Print the generated article
print(generated_article)

By fine-tuning these parameters, you can customize the behavior of the language model to generate responses that meet your specific requirements and preferences.

2. Prompt Engineering:

Crafting clear, specific prompts and questions can help guide the model to provide more accurate and contextually relevant responses. This can reduce the likelihood of hallucination. Few guidelines are:

Specify the context: Providing context to the model can help it generate more accurate responses. For example, if you want the model to summarize a news article, provide the title and author of the article as part of the prompt. Introduce constraints as per your requirement.
• Ask specific questions: Asking specific questions can help the model focus on generating factual information. For example, instead of asking “What do you think about climate change?”, ask “What are the causes of climate change?”.
• Use structured prompts: Structured prompts can help guide the model towards generating more accurate responses. For example, if you want the model to generate a recipe, use a structured prompt that includes ingredients, cooking time, and instructions. That is give outlines for what you want to be generated.
• Use adversarial examples: Adversarial examples are inputs that are specifically designed to cause an AI model to make a mistake. By using adversarial examples during training, you can help the model learn to avoid hallucinations. Give multiple examples by both asking a question and answering it yourself in the prompt.

Example — 1

import openai

# Set the prompt with various prompt engineering techniques
prompt = """
Compose a well-researched article on the causes, consequences, and current scientific consensus regarding climate change. 
Please ensure that the content is factual, unbiased, and backed by credible sources such as the latest reports from the Intergovernmental Panel on Climate Change (IPCC) and peer-reviewed studies.
Consider presenting different perspectives on climate change, including arguments for and against, to offer a balanced view. 
Adhere to ethical guidelines, avoiding any language or content that may perpetuate stereotypes or biases. 
The article should not exceed 800 words.
"""

# Configure OpenAI API for generating the article
engine = "text-davinci-002"
max_tokens = 800

# Generate the requested article
response = openai.Completion.create(
    engine=engine,
    prompt=prompt,
    max_tokens=max_tokens,
)

generated_article = response.choices[0].text

# Print the generated article
print(generated_article)

Example — 2

import openai

# Multiple examples for sentiment analysis
prompt = """
Analyze the sentiment of the following customer reviews:

1. Positive Review: "I absolutely loved the product! It exceeded my expectations."
2. Negative Review: "The product was a total disappointment. I regret buying it."
3. Positive Review: "This is the best service I've ever received. Highly recommended."
4. Negative Review: "I had a terrible experience with their customer support. Avoid!"
5. Positive Review: "Outstanding quality and exceptional value for the price."
"""

# Configure the OpenAI API
engine = "text-davinci-002"
max_tokens = 150  # Limit the response length

# Generate sentiment analysis based on the provided examples
response = openai.Completion.create(
    engine=engine,
    prompt=prompt,
    max_tokens=max_tokens
)

# Print the sentiment analysis results
print(response.choices[0].text)

Example — 3

import openai

# Multiple examples for text classification
prompt = """
Classify the following texts into categories: technology, sports, or food.

1. Technology: "The latest smartphone from XYZ Corporation features a powerful processor and a stunning display."
2. Sports: "In a thrilling match, the home team secured a 3-2 victory with a last-minute goal."
3. Food: "This restaurant serves exquisite dishes, with a diverse menu that includes both local and international cuisines."
4. Technology: "A breakthrough in quantum computing technology promises to revolutionize data processing."
5. Sports: "The tennis championship finals took place, with intense rallies and a nail-biting tie-breaker."
6. Food: "Explore the world of gourmet chocolate with these delectable, handcrafted truffles."

Text to Classify: "A new restaurant opened downtown, offering a fusion of global cuisines."
"""

# Configure the OpenAI API
engine = "text-davinci-002"
max_tokens = 150  # Limit the response length

# Generate the classification for the provided text
response = openai.Completion.create(
    engine=engine,
    prompt=prompt,
    max_tokens=max_tokens
)

# Print the classification result
print(response.choices[0].text)

3. Post-processing and User Awareness:

Implement human or automated review processes to filter out hallucinated content. Fact-checking or external verification can be part of this process.

Make users aware that the model may generate incorrect information and encourage critical thinking. Users should not blindly trust the model’s responses.

You can introduce Automated Review and filtering techniques such as Custom Filters or AI moderation.

1. AI Moderation:

To implement AI moderation, you can use OpenAI’s “content filtering” feature to automatically filter out content that may be harmful or violate content guidelines.

Here’s a Python example of how to use OpenAI’s content filtering:

import openai

# Define your OpenAI API key and the content to be moderated
api_key = "YOUR_OPENAI_API_KEY"
content = "Content to be moderated goes here."

# Configure the OpenAI API
openai.api_key = api_key

# Use OpenAI's content filtering to detect and filter problematic content
response = openai.ContentFiltering.create(
    model="content-filter-alpha-1.2",
    content=content
)

# Check the moderation result and take action
if response.label == "2":
    print("Content is classified as Safe")
elif response.label == "1":
    print("Content is classified as Sensitive")
else:
    print("Content is classified as Unsafe")
    # You can choose to filter out or review the content further

2. Custom Filter:

Implementing a custom filter involves creating your own filtering logic to identify and flag potential hallucinations based on specific criteria. Here’s a Python example of a simple custom filter:

import nltk
import spacy

# Download NLTK data (if not already downloaded)
nltk.download("punkt")

# Load the spaCy English NER model
nlp = spacy.load("en_core_web_sm")

# Custom filter function to check for unexpected named entities in a news article
def custom_news_filter(news_article):
    # Process the news article using spaCy for NER
    doc = nlp(news_article)

    # Define a list of expected named entities for a news article
    expected_entities = ["CNN", "New York Times", "World Health Organization", "Washington", "2022"]

    # Check for unexpected named entities
    unexpected_entities = [ent.text for ent in doc.ents if ent.text not in expected_entities]

    if unexpected_entities:
        return True  # Flag the article for review due to unexpected named entities

    return False  # Article is considered coherent

# Example news article to be filtered
news_article_to_filter = """
In a recent report by XYZ News, it was claimed that Washington-based World Health Organization (WHO) predicts a dramatic economic shift in 2022. 
This information, however, contradicts the New York Times article published last week.
"""

# Apply the custom news filter
if custom_news_filter(news_article_to_filter):
    print("News article flagged for review due to unexpected named entities.")
else:
    print("News article is considered coherent.")

4. Improvising Training Data:

If you have access to the data in which LLM is to be trained on then you can think of fine tuning the data itself for better responses. Use more diverse and reliable training data, with a focus on minimizing biases and inaccuracies. Datasets should be carefully curated.

Here are several strategies to enhance training data:

Curate High-Quality Data and Domain-Specific Data: If the LLM is used for domain-specific tasks, include domain-specific training data. This helps the model generate accurate content within a particular field.

Contemporary Data: Ensure that the training data is up-to-date. Outdated information can lead to inaccuracies in generated content. Regularly update the dataset to reflect current knowledge.

Negative Examples: Include negative examples that highlight incorrect or misleading information. This helps the model learn to differentiate between accurate and inaccurate content.

Disputed Content: Include content that is known to be disputed or controversial. Training on such data can help the model recognize nuanced or contentious topics.

Bias Mitigation: Be vigilant about mitigating biases in the training data. Use techniques to identify and reduce biases that may lead to hallucinations.

Regular Retraining: Periodically retrain the model with updated and improved training data. This helps the model adapt to changes in language and knowledge.

External Feedback Loops: Establish feedback mechanisms to receive input from users and external reviewers. They can help identify hallucinated content and provide corrections or clarifications.

Adversarial Testing: Conduct adversarial testing to challenge the model with deliberately misleading or hallucinated inputs. Use the results to identify weaknesses and improve the training data.

Improving the training data is an ongoing process that requires continuous monitoring, feedback, and refinement. It is a critical step in enhancing the accuracy and reliability of LLM-generated content while reducing hallucinations.

It’s important to note that complete eradication of hallucination in language models is challenging, and there may still be instances where the models generate incorrect or biased content. Continuous research and development are essential to address these issues and improve the performance of large language models.

Please follow me for more such content on Python, Machine Learning and Generative AI.

Refer to other helpful articles

Machine Learning Interview Questions 2023