Mathematically Evaluating Hallucinations in LLMs like GPT4

Freedom Preetham
Autonomous Agents
Published in
12 min readMar 18


Mathematically evaluating hallucinations in Large Language Models (LLMs) like GPT4 (used in the new ChatGPT plus) is challenging because it requires quantifying the extent to which generated outputs diverge from the ground truth or contain unsupported information.

It is essential to note that even humans confabulate, hallucinate, or makeup stuff when presented with a prompt, even when there is no intrinsic or extrinsic motive to lie. It’s almost like an innate feature (or bug) of all intelligent (or complex dynamical) systems.

I have written several accounts on how LLMs hallucinate or break on logical reasoning. Here are a few past blogs.

In this blog, I present a comprehensive guide to evaluating LLMs across many dimensions and provide insights into mathematical frameworks that can help explain and better understand hallucinations.

GPT4 Limitations ~ Hallucinations

GPT4 research site states the following limitations:

Despite its capabilities, GPT-4 has similar limitations as earlier GPT models. Most importantly, it still is not fully reliable (it “hallucinates” facts and makes reasoning errors). Great care should be taken when using language model outputs, particularly in high-stakes contexts, with the exact protocol (such as human review, grounding with additional context, or avoiding high-stakes uses altogether) matching the needs of a specific use-case.

“While still a real issue, GPT-4 significantly reduces hallucinations relative to previous models (which have themselves been improving with each iteration). GPT-4 scores 40% higher than our latest GPT-3.5 on our internal adversarial factuality evaluations:”

What are Hallucinations?

Hallucinations in LLMs occur when they produce responses that do not accurately reflect the given context, are not supported by evidence, or deviate from the expected behavior based on their training data.

Here are some examples of hallucinations in LLM-generated outputs:

  1. Factual Inaccuracies: The LLM produces a statement that is factually incorrect.
  2. Unsupported Claims: The LLM generates a response that has no basis in the input or context.
  3. Nonsensical Statements: The LLM produces a response that doesn’t make sense or is unrelated to the context.
  4. Improbable Scenarios: The LLM generates a response that describes an implausible or highly unlikely event.

Before understanding the mathematical models, let’s understand the basic evaluation metrics for LLMs.

Technical Evaluation Metrics

Large Language Models (LLMs) are typically evaluated on a wide range of tasks, reflecting their ability to understand and generate natural language across diverse applications. While the specific evaluation metrics and tests may vary depending on the task, here are some common metrics and tests LLMs are often evaluated on:

Language Modeling:

  • Perplexity: Measures how well the model predicts the probability distribution of the given test data. Lower perplexity indicates a better language model.
  • Cross-Entropy Loss: Measures the average negative log-likelihood of the true probability distribution given the model’s predicted probability distribution.

Text Classification and Sentiment Analysis:

  • Accuracy: The proportion of correctly classified instances out of the total instances.
  • Precision, Recall, and F1-score: These metrics measure the trade-off between false positives and false negatives, and their harmonic mean, respectively.
  • AUC-ROC (Area Under the Receiver Operating Characteristic Curve): Measures the trade-off between true positive rate and false positive rate at various classification thresholds.

Machine Translation:

  • BLEU (Bilingual Evaluation Understudy): Measures the similarity between the model-generated translations and reference translations by computing n-gram precision.
  • METEOR (Metric for Evaluation of Translation with Explicit Ordering): Considers n-gram matches and alignments between the translation and reference, including synonyms and stemming.
  • TER (Translation Edit Rate): Measures the number of edits (insertions, deletions, substitutions) required to transform the model-generated translation into the reference translation.

Text Summarization:

  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation): A set of metrics (ROUGE-N, ROUGE-L, ROUGE-S) that measure the overlap of n-grams, longest common subsequences, and skip-bigrams between the generated summary and reference summaries.

Named Entity Recognition:

  • Precision, Recall, and F1-score: These metrics are used to evaluate named entity recognition tasks, considering exact matches of entity boundaries and entity types.

Question Answering:

  • F1-score: The harmonic mean of precision and recall, considering exact token matches between the model-generated answer and the reference answer.
  • EM (Exact Match): A binary metric that measures whether the model-generated answer exactly matches the reference answer.

Linguistics, Logic and Common Sense Reasoning

Large Language Models (LLMs) are often evaluated on a variety of linguistic and logic tasks to assess their ability to understand and reason about natural language. Some of the common linguistic and logic evaluations include:

  1. Pronoun Disambiguation: Pronoun disambiguation is a natural language processing task that involves determining the correct antecedent (the noun or noun phrase to which a pronoun refers) for a given pronoun in a sentence or text. Pronouns, such as “he,” “she,” “it,” “they,” “his,” “hers,” and “theirs,” are used to avoid repetition and maintain coherence in language. However, pronouns can be ambiguous, and understanding which noun or noun phrase they refer to is essential for proper interpretation of the text.
  2. Winograd Schema Challenge (WSC): Winograd Schema is similar to pronoun disambiguation but is a specific type of linguistic test that is designed to evaluate the common sense reasoning and natural language understanding capabilities of AI systems. It often involves pronoun disambiguation, but the primary focus of the test is to challenge AI systems with scenarios that require a deeper understanding of context and common sense knowledge. For example: “The city councilmen refused the demonstrators a permit because they feared violence.” The challenge in this sentence is to determine whether “they” refers to the city councilmen or the demonstrators.
  3. Textual Entailment: The task of determining if a given hypothesis can be inferred from a given premise. The model is evaluated based on its ability to classify the relationship between pairs of sentences as entailment, contradiction, or neutral.
  4. Semantic Role Labeling: This evaluation involves identifying the semantic roles (e.g., agent, patient, instrument) of words or phrases in a sentence. It requires understanding the predicate-argument structure and relationships between entities.
  5. Closure Tasks: These tasks test the model’s ability to fill in missing information in a sentence or paragraph. They often involve predicting a missing word or phrase that completes the meaning of the text.
  6. Abductive Reasoning: This evaluation tests the model’s ability to generate the most plausible explanation for a given set of observations. It requires the model to reason about possible causes and effects, as well as background knowledge.
  7. Logical Reasoning: Tasks that involve evaluating a model’s ability to reason about logical relationships, such as syllogisms (e.g., “All A are B. All B are C. Therefore, all A are C.”) or mathematical word problems.
  8. Commonsense Reasoning: These evaluations assess the model’s ability to reason about everyday situations and make inferences based on general knowledge or common sense. Examples include the CommonsenseQA dataset and the CODAH dataset.
  9. Analogical Reasoning: This task requires the model to identify relationships between pairs of words or concepts and apply those relationships to a new pair of words or concepts. For example, given the analogy “man:king::woman:x,” the model should predict “x = queen.”
  10. Ambiguity Resolution: Evaluating the model’s ability to disambiguate words with multiple meanings based on the context in which they appear. For example, understanding that “bank” can refer to a financial institution or the side of a river, depending on the context.
  11. Temporal Reasoning: Assessing the model’s ability to reason about events and their order in time. This may involve understanding the sequence of events in a story or predicting the chronological order of historical events.
  12. Spatial Reasoning: Evaluating the model’s ability to reason about spatial relationships and understand descriptions of physical layouts, such as directions or the arrangement of objects in a scene.

Evaluations for Hallucinations

Evaluations for LLMs to ensure they do not hallucinate, i.e., generate plausible but incorrect or unsupported information, typically involve comparing generated outputs to ground truth data or using human judgments. Here are some evaluation methods to minimize hallucination:

  1. Fact-checking Evaluation: Compare the generated outputs to a knowledge base or a set of trusted sources to ensure that the facts generated by the model are accurate and supported by evidence.
  2. Groundedness Evaluation: Assess the model’s ability to generate outputs that are well-supported by the given context, input data, or a known knowledge base. This can involve creating evaluation datasets that specifically test the model’s ability to stick to facts and avoid producing information that is not grounded in the input or context.
  3. Reference-based Evaluation: For tasks like machine translation or text summarization, compare the model-generated output with one or more reference outputs created by humans or other trusted sources. Metrics like BLEU, ROUGE, and METEOR can help in these evaluations.
  4. Human Evaluation: Employ human evaluators to assess the quality, relevance, and correctness of the generated outputs. Humans can be asked to rate the generated outputs on various criteria, such as factuality, coherence, and relevance.
  5. Adversarial Evaluation: Create evaluation datasets with adversarial examples designed to challenge the model’s ability to avoid hallucination. These datasets can contain examples with subtle changes, incorrect information, or contradictions that may cause the model to generate incorrect outputs.
  6. Contrastive Evaluation: Present the model with a set of alternative completions or responses, where some options may include hallucinated information. Evaluate the model’s ability to select the correct or most plausible output among the alternatives.
  7. Counterfactual Evaluation: Generate alternative inputs by modifying the original input in various ways (e.g., negating a fact, changing an entity, or rephrasing a statement) and evaluate the model’s ability to maintain groundedness and safety across these alternative inputs.
  8. Negative Training Examples: During training, include examples with hallucinated information labeled as incorrect in the training data. This approach helps the model learn to avoid generating similar hallucinations during inference.
  9. Evaluation Metrics that Penalize Hallucination: Develop or use evaluation metrics that specifically penalize the model for generating hallucinated information. For example, metrics that consider the overlap between the generated output and the ground truth data may be more sensitive to hallucination.
  10. Fine-grained Evaluation: Break down the evaluation of generated outputs into smaller, more specific components to identify where hallucinations might occur. For example, in a question-answering task, evaluate the model’s ability to extract specific facts, reason about them, and provide accurate answers without introducing unsupported information.
  11. Safety Evaluation: Though this is not part of hallucination evaluations, it is important to add that safety check to ensure the model does not cause harm. Here we assess the model’s ability to handle unsafe or harmful content, such as offensive language, misinformation, or biased outputs. This can involve evaluating the model on safety benchmark datasets, such as the RealToxicityPrompts dataset or the AI Incident Database, which contain examples that may trigger unsafe outputs.

Mathematical Frameworks to Understand Hallucinations


Although hallucination in LLMs is an active area of research, and complex mathematical theories are still being developed to explain and control this phenomenon, some theoretical frameworks can provide insights into the underlying causes and potential mitigations. Here are a few of them:

  1. Overfitting and Memorization: Overfitting occurs when a model learns to fit the training data too closely, capturing noise instead of the underlying patterns. This can lead to hallucination in LLMs when they generate outputs that are not well-grounded in the input or context. Techniques such as dropout, weight decay, and early stopping can help mitigate overfitting and potentially reduce hallucination.
  2. Distribution Shift: Hallucination in LLMs can be partially attributed to the differences between the training data distribution and the test data distribution. When the model encounters inputs that are significantly different from the training data, it may hallucinate to generate outputs. Domain adaptation, transfer learning, and meta-learning are techniques that can help address distribution shift and mitigate hallucination.
  3. Maximum Likelihood Estimation (MLE) Bias: LLMs are typically trained using maximum likelihood estimation, which encourages models to assign high probability to observed data. However, this can lead to a bias towards generating outputs that are high probability under the training distribution, even if they are not grounded in the input or context. Techniques like Minimum Risk Training (MRT) or Reinforcement Learning from Human Feedback (RLHF) can help address MLE bias and potentially reduce hallucination.
  4. Model Uncertainty and Calibration: LLMs can sometimes generate hallucinated outputs with high confidence, even though they are incorrect or unsupported by evidence. Developing methods to estimate and calibrate model uncertainty can help identify cases where the model is likely to hallucinate and provide more reliable outputs. Bayesian modeling and temperature scaling are examples of approaches that can help estimate and calibrate model uncertainty.

Developing mathematical theories to better understand and model hallucinations in Large Language Models (LLMs) is an ongoing research area. Some of the mathematical and theoretical frameworks that can potentially help in this direction include:

  1. Bayesian Modeling: Bayesian models provide a probabilistic framework for reasoning about uncertainty, which can be useful for modeling and controlling hallucinations. By incorporating prior knowledge about the data-generating process and updating beliefs based on observed data, Bayesian methods can potentially reduce the likelihood of generating hallucinated content.
  2. Information Theory: Information-theoretic concepts, such as mutual information and conditional entropy, can be used to measure the degree of dependence between generated outputs and input data. By encouraging models to maximize the mutual information between inputs and outputs, it might be possible to reduce hallucination.
  3. Causal Inference: Causal reasoning provides a framework for understanding the relationships between variables, which can help identify when a generated output is not causally grounded in the input. By incorporating causal models into LLMs, it might be possible to better understand and control hallucination.
  4. Game-theoretic Adversarial Training: Adversarial training is a technique that involves training a model in the presence of adversarial examples. This approach can be used to encourage LLMs to generate outputs that are more robust to perturbations in the input data and less likely to hallucinate. Game-theoretic concepts can be employed to develop adversarial training methods that specifically target hallucination.
  5. Regularization Techniques: Regularization methods add constraints or penalties to the model’s objective function to encourage desired properties in the learned model. For instance, incorporating penalties that discourage divergence from the input data or encourage outputs to be well-grounded in the training data might help reduce hallucination.
  6. Explainable AI (XAI): Explainable AI techniques aim to make model predictions more understandable and interpretable. By developing methods that can provide explanations for the generated outputs of LLMs, it might be possible to identify and mitigate cases of hallucination.
  7. Graph Theory: Graph-based representations of language can help capture complex relationships between entities and concepts in a more structured way. By incorporating graph-based reasoning into LLMs, it might be possible to better model groundedness and reduce the likelihood of hallucination.

Borrowing from other Probabilistic Domains


There are some conversations I have been part of that that involves borrowing from Copula Theory and Extreme Value Theory. While these are not directly applicable to curbing hallucinations in LLMs, they can inspire some ideas for developing new methods.

Copula Theory deals with modeling dependencies between random variables, while Extreme Value Theory focuses on modeling the tails of distributions and rare events. Here are some ways in which these ideas could potentially be adapted for LLMs:

Modeling Dependencies: Copula Theory can inspire the development of methods that explicitly model dependencies between input and output tokens in LLMs. By better capturing the relationship between input and output tokens, it may be possible to encourage the model to generate outputs that are more grounded in the input, reducing hallucination.

For example, one could develop a modified training objective that incorporates a term measuring the dependency between the input and generated tokens, such as mutual information or some other measure inspired by Copula Theory. By optimizing this new objective, the model may learn to generate outputs that are more closely tied to the input and less likely to hallucinate.

Modeling Tail Behavior: Extreme Value Theory focuses on the tails of distributions, where rare events occur. Hallucinations in LLMs can be seen as a kind of rare event, where the model generates an output that deviates significantly from the expected behavior.

One possible approach inspired by Extreme Value Theory is to create a training objective that penalizes the model for generating extreme or unlikely outputs. By developing a measure of extremeness for LLM-generated outputs, it may be possible to encourage the model to avoid generating hallucinations by penalizing these extreme cases.

Another possibility is to create an adversarial training dataset, where the input-output pairs are designed to challenge the model’s ability to avoid hallucination. The model could then be fine-tuned on this adversarial dataset, with the goal of improving its robustness to hallucination.

While these ideas are inspired by Copula Theory and Extreme Value Theory, it’s important to note that they are not direct applications of these theories. Adapting these concepts to LLMs requires further research and development, as well as rigorous evaluation to determine their effectiveness in curbing hallucinations.


It is important to note that these methods only provide indirect or proxy measures of hallucination, as a quantifying hallucination in LLMs is a complex and open research problem. Combining multiple evaluation methods, mathematical modeling, and human judgment can help obtain a more comprehensive assessment of hallucination in LLMs like ChatGPT.