Analysis of Severe Hallucinations In the Context of AI-Generated Clinical Working Summaries: Introducing A Novel Classification Framework

By Daniel Kreitzberg, PhD and Ruben Amarasingham, MD

Introduction

As AI models are increasingly deployed in healthcare, one of the critical challenges is ensuring the accuracy and reliability of AI-generated clinical summaries. Large language models (LLMs), or Generative Pre-trained Transformer (GPT) models, are state-of-the-art AI models that are especially useful for processing textual information.[1] In lay terms, LLM models are trained on very large amounts of textual data from which the model learns connections between concepts. These learned connections (parameters) allow the model to understand natural language and abstractive summarization (summarization of important components of text rather than extracting chunks of text). Despite their capabilities, these models have important limitations. In this blogpost we are going to turn our attention to one error type called “hallucinations” and how Pieces attempts to classify or grade this type of error in the context of the Pieces working summary.

Hallucinations occur when an AI model generates new, inaccurate or unrelated information to the question at hand or known source information.[2] The reasons why LLMs may generate hallucinations are multifaceted but at its simplest level these errors occur because the model is making a prediction on what should be the correct output, rather than knowing and representing the output definitively. It tries its best to make an answer based on its trained knowledge (i.e. the LLM is overweighting its trained knowledge compared to the information input into the model.)

It’s important to acknowledge that human errors in clinical documentation are a significant issue. For example, one study of over 22k patients found 21.2% (n=4,830) reported an error in their clinical documentation and 42.3% (n=2,043) of these patients reported the error was either somewhat or very serious.[3] Considering Pieces use of LLMs for clinical hand-off summarization, the goal is to present clinicians with accurate and concise summaries that give a snapshot of the status of each patient at any given time. Therefore, we are particularly concerned with hallucinations as they have the potential to break our clients’ trust and worse, cause harm to hospital patients. Therefore, Pieces clinical experts developed a robust classification system utilized by SafeRead, board-certified physician reviewers to identify and mitigate hallucinations in Pieces working summaries. This blog post delves into the why, what, and how of our classification system, providing a comprehensive understanding of our methodology and the confidence we have in Pieces AI severe hallucination rate.

“Trabeculae” | Abstract Pencil & Watercolor. Brushes: old brush, oil paint, and master oil 2019

1. State of the Literature

The AI community has long recognized hallucinations as a significant challenge.[4,5,6] Researchers have worked on classifying AI generated hallucinations including intrinsic and extrinsic hallucinations, or hallucinations that are inaccurately summarizing information that may be verified within the input text (intrinsic) and hallucinations that cannot be verified via the source text.1 Furthermore, researchers have applied domain specific definitions of severity of hallucinations. For example, a meteorologist reviewed AI generated hallucinations within summaries of weather pattern data and categorized hallucinations as either basic (extrinsic) or spatial (intrinsic hallucinations related to geographical information) as well as the severity of the hallucination.[7]

While there is extensive research on AI generated hallucinations in general, there has been limited work on classifying these hallucinations in clinical contexts, specifically for generating hand-off summaries from clinical notes within electronic medical record systems. Therefore, Pieces clinical experts developed a new classification system to categorize hallucinations based on the possible impact on the patient if their care team believes and responds to the hallucination. Of course, our classification system is not the definitive framework for classification of hallucinations in AI generated clinical text and we welcome others to comment and improve upon our system.

2. Description of the Classification System

How we define ‘Hallucinations’ and severity categories for SafeRead physician reviewers

We define hallucinations specific to Pieces AI system hand-off summary generation use case, “New and non-factual information that is not part of the patient’s medical history.” This definition was developed by Pieces clinical and AI team as it is pertinent to the content that Pieces AI system is summarizing, clinical notes. Further, this definition is actionable for our expert SafeRead reviewers as they validate the AI generated summaries with content from clinical notes. Additionally, we ask SafeRead reviewers to indicate the level of severity of the hallucination by choosing one of four severity levels (what we call the “Pieces Hallucination Classification”:

  1. Severe: “New and non-factual information presented that if treated as fact would require emergent (less than 60 min.) AND potentially irreversible interventions that would not otherwise be justified”
  2. Major: “New and non-factual information presented that if treated as fact would require urgent (within the next 1 to 24 hours) but not emergent actions to prevent serious harm to the patient that would not otherwise be justified”
  3. Moderate: “New and non-factual information presented or details wrongly emphasized that if treated as fact would require actions in 1–2 days”
  4. Minor: “New and non-factual information presented or details wrongly emphasized that if treated as fact would still not require any treatment”

The severity categories for hallucinations incorporate a temporal component to reflect the potential urgency of treatment associated with the hallucinated content. In severe cases, the need for emergent intervention within less than 60 minutes which may prevent physicians from verifying the information due to the immediate need for treatment which can increase the risk of irreversible harm to the patient. Major hallucinations involve conditions that require actions/treatment within 1 to 24 hours, which may allow clinical teams with sufficient time to verify the condition mentioned in the Piece’s working summary. These severity categories ensure that hallucinations are assessed and addressed according to their real-world clinical urgency and potential impact on patients.

SafeRead Review Flagging System

We utilize a combination of adversarial AI, stratified sampling and client/end user feedback to flag summaries for SafeRead hallucination reviews. Specifically,

  1. Adversarial AI: Pieces AI and clinical experts developed an Adversarial AI system to flag summaries for possible hallucinations. The AI system reviews 100% of AI generated summaries and searches for the mention of one of hundreds of clinical conditions that require emergent treatment (inline with our definition of “severe” hallucinations). If one of these clinical conditions are detected, the AI system reviews the patient’s clinical documentation for supporting evidence that the condition is not hallucinated. If the adversarial AI system detects insufficient evidence to support the condition in question, the summary is flagged for Pieces SafeRead physicians review. The summary is withheld from the client until it has either been reviewed or new information confirms that the condition was not hallucinated.
  2. Stratified Sampling: This method involves flagging summaries for SafeRead review using distinct strata based on units, service lines, and length of stay. This approach differs from simple random sampling by addressing specific characteristics and variations within the patient population, thus enhancing the reliability and validity of our assessments across different clinical contexts.
  3. Client/End User Feedback: Pieces client success team gather direct feedback from clients during routine performance reviews with hospital clinical personnel. This feedback includes any concerns or issues related to the summaries generated by our AI system and whether any severe hallucinations are present. Some clients allow for end users to edit Pieces working summaries, these edits are tracked by Pieces staff and reviewed for potential severe hallucinations.

4. Preventive Measures to Protect Against Hallucinations

In addition to monitoring for hallucinations within the AI generated working summaries, Pieces AI team has designed the system to prevent hallucinations. Specifically, Pieces AI team implements several prevention strategies including grounding the data, limiting the context window, output concision, and using adversarial AI systems to flag potential issues.

  • Grounding the Clinical Data: Grounding involves anchoring the model’s responses in internally derived, verifiable sources of information, such as verified clinical documents and clinical knowledge graphs. By referencing these sources, the Pieces working summary AI system can provide responses that are more accurate and factually consistent. This reduces the chances of the model generating false or fabricated information, thereby substantially lowering the rate of severe hallucinations. Grounding ensures that the model’s outputs are based on real, authoritative information. Grounded data also provides the necessary context for the model to understand and respond accurately to input statements, generating clinical summaries that are relevant and contextually appropriate.
  • Limiting the Context Window of the Clinical Input Data: Limiting the context window means the AI system processes a smaller, more relevant portion of the health record data at any given time. This focused information processing helps in reducing the complexity and noise that can lead to hallucinations. By processing a smaller context, the model generates more precise and relevant outputs, reducing the likelihood of severe hallucinations. Additionally, smaller context windows make the model’s processing more efficient, reducing computational load and allowing for quicker and more accurate generation of hand off summaries.
  • Output Concision: The Working Summary produced by Pieces adheres to the principle of “output concision,” typically ranging from 100 to 200 words, which is consistent with handoff summaries used by clinicians. This brevity significantly reduces the likelihood of hallucinations because the hallucination rate is proportional to the length of the response. Research in natural language processing (NLP) and AI has shown that longer text generations are more prone to errors, inconsistencies, and hallucinations.[8] By focusing on providing succinct and relevant information, we minimize the opportunity for the AI to generate hallucinations, further ensuring the reliability and accuracy of the summaries provided to healthcare professionals.
  • Continuous Improvement: We continually refine the Pieces AI pipeline based on direct client feedback, SafeRead reviewer annotations, and end user edits. SafeRead reviewer annotations regularly undergo a secondary review by Pieces clinical and AI experts to dissect and locate the origin of errors found within the AI output. For example, SafeRead reviewers had found some pediatric summaries had not included past medical history. After thorough review, the Pieces clinical and AI experts found that these patients were misclassified as neonatal patients, a patient population that typically don’t have past medical history. This error was remedied with ease by the AI team and these errors did not continue.

5. Analysis of Severe Hallucinations Rates within Pieces Summaries

Estimating the rate of severe hallucination in AI-generated clinical summaries is crucial for ensuring the reliability and safety of these tools in healthcare settings. A precise estimation of severe hallucination rates enhances trust among our end users and informs our SafeRead review strategies to detect any severe hallucinations in the future. To accomplish this, Pieces AI safety and clinical effectiveness team focused on calculating A) the observed rate of severe hallucinations among SafeRead reviews and B) the likelihood of detecting a severe hallucination across SafeRead reviews, summary utilization in production, and end user edits assuming severe hallucination rates of 3%, 1% and 0.1%.

From January to May 2024, Pieces generated approximately 1 million summaries across our client health systems and observed 0 severe hallucinations out of 12,440 manual SafeRead reviews including 8,734 summaries flagged by stratified sampling and 3,706 summaries flagged by the adversarial AI system for review. Furthermore, throughout this time period Pieces has not received any direct client feedback nor any end user edits indicating observations of severe hallucinations. Using these data points, we can show how the chance of finding at least one severe hallucination changes with different numbers of reviews, assuming a low rate of hallucinations in the population.

For example, Figure 1 shows this relationship if the true rate is 1 in 100,000. If we assume that only 30% of the summaries are reviewed by end users (about 300,000 summaries) and combine this with our SafeRead reviews, we would have an over 95% chance of finding at least one severe hallucination. This graph also helps explain why our confidence in the low hallucination rate has grown over time, from 0.1% to 0.001%, as we’ve reviewed more summaries using SafeRead while end users utilize the working summaries in production.

Figure 1. Probability of finding at least one severe hallucination by sample size assuming a true rate of severe hallucination of 1 in 100,000.

Now, focusing solely on the 12,440 SafeRead reviews:

While it’s true that we’ve found zero severe hallucinations in these 12,440 reviews, suggesting a rate of less than 1 per 100,000, we can also estimate the chance of finding at least one severe hallucination if the actual rate is higher.

Here’s how it works:

p = assumed rate in the population

n = sample size

Calculate the probability of no severe hallucinations: P(no severe hallucinations) = (1-p)n

Calculate the probability of finding at least one severe hallucination: P(at least one severe hallucination) = 1 — P(no severe hallucinations)

Using this method, we can conclude:

  • If the true rate of severe hallucinations is 3% or 1% (3,000 or 1,000 per 100,000 summaries), we have a 100% chance of finding at least one severe hallucination across the 12,440 SafeRead reviews completed.
  • If the rate is 100 per 100,000, there’s a 99.99% chance.

In sum, based on the 12,440 SafeRead reviews we’ve done from January through May 2024 with zero severe hallucinations found, the chance of a true severe hallucination rate being 0.1% or higher is very low (practically impossible).

Limitations and Future Directions

While we feel that this analysis is informative it is not without limitations. For example, not all end users may recognize or report severe hallucinations, leading to underreporting and an underestimation of the true severe hallucination rate. Further, the analysis is based on data from January to May 2024. Changes in the AI system or in clinical documentation practices over time could affect the hallucination rate, making these findings potentially outdated for future periods. Lastly, our analysis focused on what we consider the highest risk hallucinations. Expanding the focus beyond severe hallucinations to include less severe but significant hallucinations will ensure comprehensive safety and trust in AI-generated clinical summaries.

Pieces clinical and AI teams monitor the latest academic research around hallucinations such as approaches to classification, adversarial AI systems, and mitigation efforts. Pieces design and client success teams continue to advance our end user feedback mechanisms to enable convenient end user feedback channels. These additional end user feedback channels will inform our AI system updates and potential discovery of severe hallucinations. We welcome questions or suggestions on refinements or improvements to discovering and mitigating hallucinations in our AI system. As numerous regulatory bodies consider approaches to clinical AI risk management frameworks, we call for standardization and greater research on use of LLMs for summarization of clinical notes. We will continue to elaborate on this work with future blog posts focused on describing less severe hallucinations as Pieces Technologies is in a unique position to share our findings of the accuracy and safety of use of LLMs with real-world data.

[1]: Zhang, Tianyi, et al. “Benchmarking large language models for news summarization.” Transactions of the Association for Computational Linguistics 12 (2024): 39–57.

[2]: Ji, Ziwei, et al. “Survey of hallucination in natural language generation.” ACM Computing Surveys 55.12 (2023): 1–38.

[3]: Bell, Sigall K., et al. “Frequency and types of patient-reported errors in electronic health record ambulatory care notes.” JAMA network open 3.6 (2020): e205867-e205867.

[4]: Zhou, Chunting, et al. “Detecting hallucinated content in conditional neural sequence generation.” arXiv preprint arXiv:2011.02593 (2020).

[5]: Maynez, Joshua, et al. “On faithfulness and factuality in abstractive summarization.” arXiv preprint arXiv:2005.00661 (2020).

[6]: Martindale, Marianna, et al. “Identifying fluently inadequate output in neural and statistical machine translation.” Proceedings of Machine Translation Summit XVII: Research Track. 2019.

[7]: González-Corbelle, Javier, et al. “Dealing with hallucination and omission in neural Natural Language Generation: A use case on meteorology.” Proceedings of the 15th International Conference on Natural Language Generation. 2022.

[8]: Ji, Ziwei, et al. “Survey of hallucination in natural language generation.” ACM Computing Surveys 55.12 (2023): 1–38.

--

--

Pieces Technologies
Can AI and doctors collaborate to write a clinical summary?

Pieces Technologies is a healthcare AI firm that applies ensemble AI methods to support the work of multidisciplinary healthcare teams.