LLMs Safety Essentials: Hallucinations

10 min readDec 24, 2023

In today’s rapidly evolving tech landscape, the safety and quality of Large Language Model (LLM) applications have become paramount. This article is the first in a series that aims to navigate the complexities of LLM Apps Safety and Quality. We will examine various methods for monitoring LLM systems, focusing on the detection of hallucinations, data leakage, toxicity, refusals, and prompt injections. Additionally, we will discuss strategies for developing robust monitoring systems to continuously assess the safety and security of apps. Join us on this journey to deepen your understanding and enhance the integrity of LLM applications.

Acknowledgements: the main inspiration for this article comes from the course Quality and Safety for LLM Applications by DeepLearning.AI. All other references are listed at the end of the article.

As we embark on this exploration, our first stop is the complex world of hallucination detection in LLMs.

1. Hallucinations: Background

Hallucination in the context of Large Language Models (LLMs) presents a unique and pressing challenge. It occurs when an LLM generates content that include fictional, misleading, or not realted to the question was asked. This issue stems from the model’s ability to create plausible-sounding text based on patterns learned from its training data, regardless of the content’s alignment with reality.

The occurrence of hallucinations can be unintentional, often resulting from factors such as biases in training data, the model’s lack of access to current information, or its inherent limitations in understanding and generating contextually accurate responses.

In applications where factual accuracy is critical — such as in journalism, healthcare, and legal sectors — addressing hallucinations is of utmost importance. Researchers and developers are actively seeking methods to mitigate these inaccuracies.

To detect hallucinated outputs in the form of irrelevant responses, we can employ two different techniques:

1. Prompt-Response Comparison: Measure the similarity between the prompt and the response generated by the LLM. If the response is not relevant to the prompt, it might represent a hallucination.

2. Response-Response Comparison for a Given Prompt: Here, the focus is on comparing different responses generated for the same prompt.

> Note: Semantic similarity is related to, but not the same as relevance. The response may be semantically similar by using a lot of related words, but it isn’t really answering the question directly. So, that might be considered an irrelevant response, even if the response looks similar to the prompt. As the figure below shows.

Source: Quality and Safety for LLM Applications by Deeplearning.ai

Various metrics can be utilized for these comparisons, including BLEU, BERT, Sentence Embedding, and LLM Self-Similarity.

Metrics for Prompt-Response and Response-Response comparison. Source: Quality and Safety for LLM Applications by Deeplearning.ai

Moving forward, we will now illustrate the discussed concepts through a practical example:

Full code available on my GitHub here.

2. Prompt-Response Comparison:

2.1. BLEU Score:

In prompt-response comparisons, the BLEU score gauges how closely a machine-generated response matches the original prompt. It checks for the presence of tokens (unigrams, bigrams, n-grams) from the prompt in the response.

A higher score indicates that the response accurately reflects the prompt’s content, while a lower score suggests the response may have missed the mark or veered off topic.

import helpers
# huggingface evaluate 
import evaluate
import pandas as pd
pd.set_option('display.max_colwidth', None)
chats = pd.read_csv("./data/chats.csv")

bleu = evaluate.load("bleu")
chats[5:6]

bleu.compute(predictions=[chats.loc[2, "response"]], 
             references=[chats.loc[2, "prompt"]], 
             max_order=2)

{'bleu': 0.05872202195147035,
 'precisions': [0.1, 0.034482758620689655],
 'brevity_penalty': 1.0,
 'length_ratio': 6.0,
 'translation_length': 30,
 'reference_length': 5}

We have seen how to calculate a single BLEU score; let’s now proceed to create a metric for it. We need to import a specific decorator function from whylogs. This decorator registers a function as a new metric to use in whylogs. So our function here is going to be blue_score. The output of this function will be a list of scores for the data that we see.

Here’s the Python code to implement this:

from whylogs.experimental.core.udf_schema import register_dataset_udf

@register_dataset_udf(["prompt", "response"], 
                      "response.bleu_score_to_prompt")


def bleu_score(text):
  scores = []
  for x, y in zip(text["prompt"], text["response"]):
    scores.append(
      bleu.compute(
        predictions=[x], 
        references=[y], 
        max_order=2
      )["bleu"]
    )
  return scores

We’ve created a new metric. Let’s now proceed to visualize this metric using the helper functions:

helpers.visualize_langkit_metric(
    chats, 
    "response.bleu_score_to_prompt", # must match the metric name used in the decorator
    numeric=True)

BLEU scores are heavily tailed. In our instance, many scores are quite low, with several approaching 0.5. Now, let’s examine the examples with the lowest BLEU scores:

helpers.show_langkit_critical_queries(
    chats, 
    "response.bleu_score_to_prompt", 
    ascending=True)

BLEU score is focused on the exact text of the tokens and comparing those. Now let’s conduct a similar exercise with BERT scores, which uses embeddings to find a semantic match between words.

BLEU score vs. BERT score. Source: Quality and Safety for LLM Applications by Deeplearning.ai

2.2. BERT Score

BERT Score utilizes pre-trained contextual embeddings from BERT, comparing words in the prompt and response based on the pairwise cosine similarity (i.e., each word in our prompt is compared to each word in our response).

We also use a different algorithm for comparing. So instead of using precisions, we find these max similarities and use different methods for calculating BERT scores but often importance weighting.

Illustration of the computation of BERT score. Source: BERTSCORE: EVALUATING TEXT GENERATION WITH BERT

We load the BERT score module and then we can call it with a prompt and response:

bertscore = evaluate.load("bertscore")
bertscore.compute(
    predictions=[chats.loc[2, "prompt"]],
    references=[chats.loc[2, "response"]],
    model_type="distilbert-base-uncased")

 {'precision': [0.8160363435745239],
  'recall': [0.7124581336975098],
  'f1': [0.7607377767562866],
  'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.35.2)'}

High BERT scores indicate a response is semantically similar to the prompt, suggesting that the model has produced a relevant and contextually appropriate reply. Conversely, low BERT scores indicate that the response is not semantically similar to the prompt, suggesting that the model has produced an irrelevant and contextually inappropriate reply.

Let’s go ahead and create a new metric for BERT scores. First, we’ll add our decorator. Then we’ll add our new BERT score function. and we’ll make sure to return a list of the ``F1 scores`` as our metric.

The BERT score function takes in lists of predictions and lists of references in a different way than the blue score does.

@register_dataset_udf(["prompt", "response"], "response.bert_score_to_prompt")
def bert_score(text):
  return bertscore.compute(
      predictions=text["prompt"].to_numpy(),
      references=text["response"].to_numpy(),
      model_type="distilbert-base-uncased"
    )["f1"]

# Visualize this new metric
helpers.visualize_langkit_metric(
    chats, 
    "response.bert_score_to_prompt", 
    numeric=True)

You can see here that the BERT score distribution looks quite different from the BLEU score distribution.

This one looks much more like a bell curve, with the highest frequency values being in the middle.

Let’s look at some of the queries that give us low BERT scores:

Low BERT score indicates that the response is not semantically similar to the prompt, suggesting that the response might represent a hallucination.

However, we can see a couple of flaws with using a BERT score for finding allucinations. Looking at the first row above, the response is still valid althoug it is not semantically similar to the prompt.

Now, let’s check out the evaluation for our BERT score metric. We’ll use the udf_schema from whylogs to capture all of the metrics we have created and registered as UDFs:

from whylogs.experimental.core.udf_schema import udf_schema

annotated_chats, _ = udf_schema().apply_udfs(chats)

annotated_chats.head(2)

Using the helpers functions, we can check out whether our metrics were able to capture the hallucinations in our data, given a threshold of 0.75:

Remember that a low BERT score indicates that the response is not semantically similar to the prompt, suggesting that the response might represent a hallucination.

helpers.evaluate_examples(
  annotated_chats[annotated_chats["response.bert_score_to_prompt"] <= 0.75],
  scope="hallucination")

Using BERT score with a threshold of 0.75, we can see that we have not captured any hallucinations in our data.Let’s use a lower threshold of 0.6:

helpers.evaluate_examples(
  annotated_chats[annotated_chats["response.bert_score_to_prompt"] <= 0.6],
  scope="hallucination")

Although we could not capture all easier or advanced hallucinations, the BERT score incorrectly identified at most two examples as hallucinations when they were actually valid responses. In other words, the system worked better than before, with minimal errors in incorrectly flagging valid responses as hallucinations.

Moving on, we’ll shift from comparing the prompt-response pair to evaluating various responses from an LLM to the same prompt. This approach gained recognition from the SelfCheckGPT paper, which compares a single response to multiple others using several metrics, including BLEU and BERT scores among others.

3. Response-Response Comparison

3.1. Sentence Embeddings Cosine Distance:

In Response self-comparison, we evaluate the reliability of an LLM’s responses by comparing sentence embeddings between the different responses for semantic consistency and relevance, helping to identify and weed out responses that may be misleading or incorrect:

To use this multiple response paradigm, we need to download some new data:

chats_extended = pd.read_csv("./data/chats_extended.csv")
chats_extended.head(2)

For this metric, we’ll analyze the sentence embedding cosine distance by passing our responses. We’ll employ a specific model to calculate these embeddings, utilizing the sentence transformers package for the task:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Get sentence embedding
model.encode("This is a sentence to encode.")

 array([ 1.96422078e-02,  5.68652041e-02, -2.34455187e-02,  9.43348836e-03,
        -4.11827825e-02,  3.55802514e-02,  7.15688244e-03, -7.61956945e-02,
         2.62088589e-02, -3.05646472e-02,  5.38816266e-02, -3.52195464e-02....

To compare two embeddings, we’ll compute the cosine similarity using pairwise_cos_sim utility function from the sentence transformers package. Then, let’s put in our decorator, where we’re looking at response in the two responses (i.e., response2 and response3). We’ll create a metric called response.sentenceEmbeddingSelfSimilarity:

from sentence_transformers.util import pairwise_cos_sim

@register_dataset_udf(["response", "response2", "response3"], 
                      "response.sentence_embedding_selfsimilarity")
def sentence_embedding_selfsimilarity(text):
  response_embeddings = model.encode(text["response"].to_numpy())
  response2_embeddings = model.encode(text["response2"].to_numpy())
  response3_embeddings = model.encode(text["response3"].to_numpy())
  
  # compare original response to each of the new responses
  cos_sim_with_response2 = pairwise_cos_sim(
    response_embeddings, response2_embeddings
    )
  cos_sim_with_response3  = pairwise_cos_sim(
    response_embeddings, response3_embeddings
    )
  # average the two scores
  return (cos_sim_with_response2 + cos_sim_with_response3) / 2

# Show self-similarity scores
sentence_embedding_selfsimilarity(chats_extended)

tensor([0.8013, 0.8560, 0.9625, 1.0000, 1.0000, 0.9782, 0.9865, 0.9120, 0.7757,
        0.8061, 0.8952, 0.5663, 0.8726, 0.9194, 0.7059, 0.8018, 0.7968, 0.7786,
        0.8699, 0.8510, 0.7966, 0.3910, 0.9413, 0.2194, 0.7589, 0.5235, 0.8022,
        0.8541, 0.7416, 0.7622, 0.9660, 0.8943, 0.9103, 0.8404, 0.9034, 0.9181,
        0.3976, 0.8086, 0.7563, 0.2019, 0.8313, 0.9141, 0.7838, 0.7083, 0.1625,
        0.6854, 0.5801, 0.6107, 0.9375, 0.8514, 0.1297, 0.7228, 0.9454, 0.9441,
        0.7593, 0.7788, 0.8971, 0.9896, 0.9128, 0.9158, 0.9337, 0.5688, 0.6978,
        0.8412, 0.9177, 0.9533, 0.0768, 0.8114])

helpers.visualize_langkit_metric(
    chats_extended, 
    "response.sentence_embedding_selfsimilarity", 
    numeric=True)

Most of the scores ranging between 0.7 and 1, indicating that the responses are semantically similar to each other. This suggests a lower incidence of hallucinations in our dataset.

Having a few low scores on the left indicates that these small self-similarity values could be true hallucinations. Let’s check out the examples with the lowest self-similarity scores:

3.2. LLM Self-Evaluation:

Let’s allow LLM assess its own output. Rather than applying a formula or model, we’ll input the three responses back into an LLM — either the original one or a different model dedicated to comparison.

First, we’ll see how to prompt the LLM for the similarity metric:

import openai

def prompt_single_llm_selfsimilarity(dataset, index):
    return openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{
            "role": "system",
            "content": f"""You will be provided with a text passage \
            and your task is to rate the consistency of that text to \
            that of the provided context. Your answer must be only \
            a number between 0.0 and 1.0 rounded to the nearest two \
            decimal places where 0.0 represents no consistency and \
            1.0 represents perfect consistency and similarity. \n\n \
            Text passage: {dataset['response'][index]}. \n\n \
            Context: {dataset['response2'][index]} \n\n \
            {dataset['response3'][index]}."""
        }]
    )
# Run for a single example
prompt_single_llm_selfsimilarity(chats_extended, 0)

ChatCompletion(id='chatcmpl-8SjgTV0D3cQfKfhR2Mh0SsaTNAI89', 
                choices=[Choice(finish_reason='stop', index=0,
                message=ChatCompletionMessage(content='1.00', 
                                              role='assistant', 
                                              function_call=None, 
                                              tool_calls=None))], 
                created=1701859485, 
                model='gpt-3.5-turbo-0613', 
                object='chat.completion', 
                system_fingerprint=None, 
                usage=CompletionUsage(completion_tokens=3, prompt_tokens=130, total_tokens=133))

The model return 1 for the first row (content=1), indicating perfect consistency between the responses. Let’s check out these responses:

We can filter to look at self-similarity scores that are less than 0.8:

The final example illustrates a hallucination well. We requested a translation of Python code into the fictional programming language ‘Parker’. One response was a refusal, stating it couldn’t provide the translation. However, other responses presented code, which varied greatly since the language is nonexistent. The self-similarity score for these varied responses was 0.00, which is appropriate given the context.

Concluding Insights: Hallucination Detection in LLM Applications

We’ve now explored four metrics for detecting hallucinations in LLM applications, a critical aspect of our series on LLM Apps Safety and Quality. Hallucination detection is an active and important area of research, underpinning the reliability of these systems. This marks the end of the first article in our series. In the forthcoming article, we will delve into issues of data leakage and toxicity. Stay tuned.