The Double-Edged Scalpel: Balancing Innovation and Safety in AI Healthcare

Published in

Healthcare AI SIG

15 min readJul 24, 2023

Technologies like ChatGPT, Bard and Med PaLM are tremendous improvements in AI, but it’s too early to trust them for medical advice. This article outlines the problems using generative and predictive AI in clinical settings, and some ways to use them safely.

Introduction
Hallucinations & Omissions
On Tests and Testability
The Complex Issue of Explainability
Updating Models and Prompt Engineering
Equity
Other Issues to Consider
Safe Uses for LLMs
Conclusions
References

Introduction

The current discourse on AI strangely mirrors this weekend’s big movie releases. On the one hand, Oppenheimer reminds us of the ethical and societal responsibility that comes with developing powerful and potentially dual-use technologies. On the other hand, venture capitalists and start-ups align more with Barbie’s “Let’s go party” message: AI companies raised US$25 billion in the first half of 2023, representing 18% of global funding.

Recent advances in generative AI, particularly large language models (LLMs)[1], have demonstrated potential benefits to healthcare in various areas, including diagnosis, summarization, clinical noting, disease prediction, and patient communication.

Products incorporating generative AI are coming to market faster than previous AI technologies, such as image recognition and predictive models, which had barriers to implementation including data integration and “bothersome” processes like testing for clinical safety. In contrast, a clinician can cut and paste data into ChatGPT to get advice or write notes, ignoring privacy concerns.

Understanding the strengths and weaknesses of this technology is essential and urgent.

The problem is that we don’t know how LLMs store and process knowledge. We have created complexity we don’t fully understand, so we resort to experimenting on these models, akin to examining a new biological creature; we poke them (with prompts) and see how they respond.

Thanks to research from around the world and the proliferation of smaller open-source LLMs that are easier to experiment on, our knowledge of the strengths and limitations of this technology is rapidly growing. This article highlights some of the limitations of generative AI and machine learning (ML) which we should seek to address when considering their use in healthcare, and some areas where we can use them safely.

Hallucinations & Omissions

In the realm of health, an AI’s hallucination can be a captivating illusion; it is our duty to distinguish the light of evidence from the shadows of algorithmic imagination
— ChatGPT on AI hallucination in healthcare, in the style of Aldous Huxley

LLMs like ChatGPT generate a new sequence of text based on some input prompt. LLMs are trained on vast quantities of textual data (we will stick to text-based LLMs for this article) and are fine-tuned for specific tasks or better alignment with human discourse.

Let’s say we wanted an LLM to summarise a patient’s medical history. In the prompt, we provide as context a series of discharge summaries and a request to summarise them. The following types of errors could occur:

Critical omission of important facts, such as ignoring a significant diagnosis documented in the discharge summaries;
Hallucination is where the LLM inserts a diagnosis or treatment that is not present in the provided discharge summaries; this may also include citing sources that don’t exist (this is also called extrinsic hallucination);
Incorrect facts where, say, the dose for a medication is incorrectly copied from a discharge summary (also called intrinsic hallucinations).

Initially, researchers thought hallucinations occurred when an LLM did not know a fact — it would make it up! However, hallucinations still happen when the relevant facts are present in the prompt [2]. To make matters worse, LLMs can generate hallucinations that appear plausible, and they will attempt to justify the hallucination with plausible explanations [3].

LLMs are particularly prone to hallucination when presented with a query on topics not part of their training data, known as an out-of-distribution (OOD) query. Smaller models naturally have less capacity and are usually trained on less data than massive models such as GPT-4, so they are more prone to hallucinations [4]. In production, smaller models require better guardrails to ensure we use them within the scope of their training.

Another concern is the need for Australian-created models. OOD issues and subsequent hallucinations can arise if, for example, a clinical model is trained on US data and then used on Australian clinical notes that may use different idioms and terms (tokens) not present in the original model. There may need to be more than just fine-tuning to resolve these issues.

Hallucination Mitigation

Hallucination rates and the effect of prompt structure [3]

As we can see from the chart above, the prompt structure can lead to high rates of hallucination. The research shows we can dramatically reduce the hallucination rate by modifying prompts, for example, asking for step-by-step reasoning or providing a specific output format.

Another method that reduces the hallucination rate is retrieving the most pertinent information required for the LLM to answer the question and include it as context in the prompt. This approach, called retrieval augmented generation (RAG), reduces hallucinations but is not a cure [5].

On Tests and Testability

The task of examining generative AI is not unlike attending a lively ball. Every dance, though guided by the same music, bears unique turns and steps.
— ChatGPT on non-deterministic generative AI, in the style of Jane Austen

In most jurisdictions, software is a medical device that invokes a set of regulations for a product’s development and testing. For a software product to be testable, it must be deterministic: the output will always be the same for a given set of inputs. LLMs used for chat applications are non-deterministic. The same input will not result in the same output.

How can we guarantee a product is safe if we cannot test it? Especially when we know it can hallucinate and omit information.

Always test the AI, no matter how cute and clever it appears.

Software built on large commercial LLMs, such as ChatGPT (OpenAI) and Bard (Google), incur further challenges to reproducibility. LLM vendors can update models anytime as they refine them with fine-tuning and reduce unwanted behaviour (censorship). Each model update changes its behaviour, requiring re-testing and re-validation of any downstream systems that use these foundation models [6].

Clinical software vendors can reduce the risks incurred by version change by having their own instance of an LLM, so updates and testing are under their control. Running a dedicated instance of a massive LLM such as GPT-4 may be very expensive, but there are now smaller open-source models that make this option more realistic. As mentioned in the previous section, smaller models have separate safety considerations.

We’ll see later how LLMs can be used in non-generative ways to improve testability.

The Complex Issue of Explainability

Should clinical AI be explainable? This will be a topic for another article. For this article, I pose that AI will be most helpful when we face complex decisions — in which our pattern recognition fails, and we must consider many variables. In these situations, a computer can quantify the interaction of more variables we can keep in our heads and draw on complex multivariate distributions from data.

Would we even understand if the computer tried to explain how it found the optimal projection in n-dimensional space with non-linear hierarchical interactions? Probably not. (But it might be good to know what variables it did consider and what evidence it used, if any.)

LLMs are black box models — they cannot explain their workings. The big problem with LLMs is they give great explanations! Unfortunately, these explanations have nothing to do with how they produce an answer [7]. We make it worse by fine-tuning models with Reinforcement Learning from Human Feedback (RLHF), which makes a model more likely to give explanations it thinks we like to see [8].

Generative AI training allows the models to create eloquent explanations that don’t reflect the complex computations behind their recommendations. This ability creates the risk of automation bias, where clinicians will be more likely to follow bad advice and ignore contradictory evidence when presented with convincing but incorrect explanations [9].

Most machine learning (ML) models are black box models. However, few are as seductive with their explanations as LLMs. For non-LLM models, numerous tools attempt to generate explanations by creating a simplified model from which they generate explanations. These simplified explanations (a) do not reflect the underlying ML model and (b) can amplify bias to minority subgroups in the data [10].

Updating Models and Prompt Engineering

LLMs are very expensive to create. GPT-4, which has around 1.8 trillion parameters, is estimated to have cost OpenAI over US$100m to train, ignoring all the work that’s gone on in fine-tuning using RLHF that requires actual humans. LlaMA, Meta’s 65 billion parameter open-source model, probably costs US$4–5m to train. Assuming they got it right the first time.

The high dollar and energy cost of training models precludes frequent re-training to keep models up to date with the latest evidence from the literature.

Prompt engineering is essentially trail and error.

Fortunately, it’s much cheaper to fine-tune models for specific tasks, often thousands of dollars and much less for smaller models. However, fine-tuning can cause performance degradation and forgetting previously learned knowledge [11].

A more straightforward way to use up-to-date knowledge is to include the information as context in the prompt sent to the LLM. The amount of text an LLM can consider is called the context window. We measure context capacity in tokens, where about 3 words = 4 tokens. Initially, GPT-3 allowed 2,048 tokens (about 1,500 words) which include the prompt and the output from the model. If information falls outside the context window, the LLM can’t use it for reasoning, so models forget earlier parts of a conversation.

LLMs are surprisingly good at in-context learning (ICL); they can integrate information provided within the prompt (context) with the knowledge and reasoning skills they have learned during training. So having more prompt capacity is a significant advantage, but before 2023 increasing the context window drove up the cost of training and running LLMs.

The last few months have seen tremendous breakthroughs in expanding the context length [12] to potentially millions of words without the previous disadvantages. A huge context window could allow a health record system to incorporate years of clinical notes and reports into the prompt, letting the LLM reason across a patient’s whole health history.

Large context windows can also turbo-charge the retrieval augmented generation (RAG) technique mentioned previously [4].

Recent analysis has demonstrated some emerging problems with the expanding context window. LLMs with long context windows ignore information in the middle of the prompt and are more accurate with information at the beginning and end [13].

LLMs use information effectively at the beginning and end of its context. Documents in the middle are not effectively used [13].

Another fly in the context window ointment is that LLM performance is brittle to slight modifications to prompts, including the order of examples within the prompt and the structure of the information and query [14]. Small shifts in content within the prompt can cause performance degradation and hallucinations. Frustratingly, lessons on optimal prompt construction do not transfer between models. The need to keep up with prompt idiosyncrasies has spawned a new career path called prompt engineering.

Equity

“AI, whispered into existence by human hands, can sing the unsung biases; it is our responsibility to tune its melody towards harmony”
— ChatGPT on AI model inequity, in the style of Maya Angelou

While not specific to LLMs, all machine learning models are descriptive systems — they learn from the past including its biases and inefficiencies.

Guaranteeing fairness in LLMs introduces further complexities than regular ML models. LLMs are subject to bias before primary processing begins, surfacing within the initial word encoding stages (embeddings) [15]. Subsequently, bias can emerge within neural networks based on the characteristics of the training data used to build them.

It is tough to remove bias from training data [16], and ML models have been able to learn racial characteristics from text and even X-ray images [17]. These models reflect the disparity in our system, and it takes work to compensate for these problems.

Specific ML models may be more prone to bias than others, even when trained on the same data. For example, I contributed to a recent study [18] showing that a Logistic Regression model displayed more bias to gender and indigenous status than an XGBoost model trained on the same data (chart below).

Performance on gender of 2 models trained on the same data: XGBoost (left) vs Logistic Regression (right) [18].

Ensuring fair models is a complex subject requiring considerations across the machine-learning process. When using foundation models from other sources, it’s hard to understand how inbuilt assumptions may affect the patients under your care.

Other Issues to Consider

Commercial Bias Towards Unecessary Complexity

For every large investment in an AI startup, one or more investors are waiting for a significant return on their money, usually within 1–4 years. This commercial pressure pushes into the market solutions that may be more complex than necessary. If a product is too simple, companies can’t easily protect intellectual property (IP) and can’t price it at a premium. So they prefer complex models over simpler explainable ones that, in many cases, can function just as well or better [19].

With the excitement around LLMs and ChatGPT, it is easy to forget that the biggest problems in healthcare are unwarranted clinical variation and patient safety. The solution to these problems requires prescriptive systems (rules, decision trees, pathways, statistical models) that assist clinicians in implementing what we already know in ways that reduce cognitive burden (workflow integration). Descriptive/predictive ML models can help, but not all problems require them.

Uncertainty Management

As discussed earlier, LLMs, and deep learning models generally don’t do well with out-of-distribution (OOD) inputs. Ideally, a model should be able to identify if the inputs it received are within its distribution of training and express low confidence in the output. That doesn’t happen. LLMs and other neural networks will confidently make predictions on OOD inputs.

Other models better at assessing OOD data may be required to function as guardrails for LLMs to ensure they operate within the specified parameters.

Model Drift

All predictive models can suffer degradation in performance over time as underlying data changes or process changes result in changes in labelled data use [20]. Teams should monitor models over time for drift, which is relatively straightforward in predictive models, but methods to do this for generative AI still need to be developed.

Drift may not only occur at the top-level performance. Overall model performance can remain acceptable while models become increasingly biased — requiring ongoing subgroup analysis.

Other Issues Not Covered

While I’ve covered the main issues relating to LLMs there are many other generally applicable issues to consider when looking at deploying any ML or AI into production:

Privacy and security
Regulatory oversight
Integration with existing systems and data quality
Training and evaluation
Workflow integration and monitoring
Guardrails for defined use and blocking adversarial inputs
And I’m sure there’s more…let me know what I’ve missed!

Safe Uses for LLMs

Predictive Tasks

Numerous researchers have shown LLMs can be effective for predictive (non-generative) tasks. The transformer architecture that underlies LLMs comprises encoder and decoder components. The decoder generates text sequences, but if we don’t need that functionality, we can use the encoder independently to use natural language understanding for prediction tasks. Google’s original transformer in 2018 was an encoder model called BERT [21]; many of the predictive models today are based on BERT.

Encoder models are several orders of magnitude smaller than decoder models and perform well using clinical notes for prediction in outcome measures (predicting mortality, length of stay, etc.) [22] and clinical disease prediction [23].

Feature Extraction from Clinical Notes and Reports

Sometimes we want to know specific information from the medical history. For example, does a patient have a family history of heart disease? While this seems a simple question, it’s hard to automate because clinical notes are usually unstructured (there is no specific field for family history). Furthermore, clinicians can use a variety of semantics, such as “Father died from stroke”, “Mother had stent for ACS”, or “FHx: NA”, and so on.

LLMs understand the nuances of language and are good at handling varying expressions and negations to extract specific clinical information of interest [24]. Once extracted, we can use these features in risk scores, clinical trial eligibility algorithms, or explainable models.

Pre-Prepared Generative Content

Not all generative AI uses need to be ‘live’ and direct to the patient. Recent research indicates that patients like the empathetic output from ChatGPT for answering medical questions [25], but the risk of inaccurate information remains a barrier.

Another approach is to map patient queries to validated information. We often know the kinds of questions patients ask, and we can generate content addressing these questions for different health literacy levels, cultural backgrounds and languages using one or more LLMs. Domain specialists can vet content to ensure it is accurate. In a chatbot, a patient query that requires clinical information can be mapped to the closest approved content, similar to the methods used in retrieval augmented generation.

Conclusions

You may think I’m against AI innovation by raising all these issues. However, I genuinely believe that we risk slowing innovation if researchers, vendors and customers do not consider what it takes to deploy this new AI technology into clinical settings safely.

It’s not clear that LLMs alone can avoid hallucinations and some of the other shortcomings mentioned; it may be that a combination of different technologies work together to make AI systems robust for clinical use. You may rest assured many people are working on finding solutions to these problems.

There is a ‘gold-rush’ mentality to publish articles and launch products which do not provide potential users with the information required to make responsible decisions on how (and if) to use AI tools for clinical use. I hope this has helped to inform you on what questions to ask.

References

[1] A Non-Technical Introduction to Transformers

In this article I will endeavour to demystify Transformer deep learning networks, the technology that has for the first…

medium.com

2. On Faithfulness and Factuality in Abstractive Summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, Ryan McDonald. Proceedings of the 58th Annual Meeting of the Association…

aclanthology.org

3. How Language Model Hallucinations Can Snowball

A major risk of using language models in practical applications is their tendency to hallucinate incorrect statements…

arxiv.org

4. Textbooks Are All You Need

We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1…

arxiv.org

5. Retrieval Augmentation Reduces Hallucination in Conversation

Despite showing increasingly human-like conversational abilities, state-of-the-art dialogue models often suffer from…

arxiv.org

6. How is ChatGPT’s behavior changing over time?

GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models…

arxiv.org

7. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought…

Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before…

arxiv.org

8. Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

Large language models (LLMs) are trained to imitate humans to explain human decisions. However, do LLMs explain…

arxiv.org

9. The Role of Explanations on Trust and Reliance in Clinical Decision Support Systems

Clinical decision support systems (CDSS) are increasingly used by healthcare professionals for evidence-based diagnosis…

ieeexplore.ieee.org

10. The Road to Explainability is Paved with Bias: Measuring the Fairness of Explanations

Machine learning models in safety-critical settings like healthcare are often blackboxes: they contain a large number…

arxiv.org

11. Exploring the Benefits of Training Expert Language Models over Instruction Tuning

Recently, Language Models (LMs) instruction-tuned on multiple tasks, also known as multitask-prompted fine-tuning (MT)…

arxiv.org

12. LongNet: Scaling Transformers to 1,000,000,000 Tokens

Scaling sequence length has become a critical demand in the era of large language models. However, existing methods…

arxiv.org

13. Lost in the Middle: How Language Models Use Long Contexts

While recent language models have the ability to take long contexts as input, relatively little is known about how well…

arxiv.org

14. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, Pontus Stenetorp. Proceedings of the 60th Annual Meeting of the…

aclanthology.org

15. Understanding the Origins of Bias in Word Embeddings

Popular word embedding algorithms exhibit stereotypical biases, such as gender bias. The widespread use of these…

proceedings.mlr.press

16. In the Name of Fairness: Assessing the Bias in Clinical Record De-identification

Data sharing is crucial for open science and reproducible research, but the legal sharing of clinical data requires the…

arxiv.org

17. Algorithmic Fairness in Chest X-ray Diagnosis: A Case Study

Machine learning models are being increasingly deployed in real-world clinical environments. However, these models…

mit-serc.pubpub.org

18. Equitable Machine Learning for Hypoglycaemia Risk Management. Rodriguez, J,
    Padilla, D, Bruce, L, Thow, B, Pradhan, M. MedInfo 2023, Sydney, Aus.
    In Press.

19. Stop explaining black box machine learning models for high stakes decisions and use interpretable…

There has been a recent rise of interest in developing methods for ‘explainable AI’, where models are created to…

www.nature.com

20. Machine Learning Model Drift

Types, causes, detections, mitigations, and tools

towardsdatascience.com

21. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations…

arxiv.org

22. Health system-scale language models are all-purpose prediction engines — Nature

Physicians make critical time-constrained decisions every day. Clinical predictive models can help physicians and…

www.nature.com

23. Efficient Diagnosis Assignment Using Unstructured Clinical Notes

Louis Blankemeier, Jason Fries, Robert Tinn, Joseph Preston, Nigam Shah, Akshay Chaudhari. Proceedings of the 61st…

aclanthology.org

24. Large language models are few-shot clinical information extractors

Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, David Sontag. Proceedings of the 2022 Conference on Empirical…

aclanthology.org

25. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a…

In this cross-sectional study, a chatbot generated quality and empathetic responses to patient questions posed in an…

pubmed.ncbi.nlm.nih.gov

The Double-Edged Scalpel: Balancing Innovation and Safety in AI Healthcare

Table of Contents

Introduction

Hallucinations & Omissions

Hallucination Mitigation

On Tests and Testability

The Complex Issue of Explainability

Updating Models and Prompt Engineering

Equity

Other Issues to Consider

Commercial Bias Towards Unecessary Complexity

Uncertainty Management

Model Drift

Other Issues Not Covered

Safe Uses for LLMs

Predictive Tasks

Feature Extraction from Clinical Notes and Reports

Pre-Prepared Generative Content

Conclusions

References

[1] A Non-Technical Introduction to Transformers

In this article I will endeavour to demystify Transformer deep learning networks, the technology that has for the first…

2. On Faithfulness and Factuality in Abstractive Summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, Ryan McDonald. Proceedings of the 58th Annual Meeting of the Association…

3. How Language Model Hallucinations Can Snowball

A major risk of using language models in practical applications is their tendency to hallucinate incorrect statements…

4. Textbooks Are All You Need

We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1…

5. Retrieval Augmentation Reduces Hallucination in Conversation

Despite showing increasingly human-like conversational abilities, state-of-the-art dialogue models often suffer from…

6. How is ChatGPT’s behavior changing over time?

GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models…

7. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought…

Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before…

8. Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

Large language models (LLMs) are trained to imitate humans to explain human decisions. However, do LLMs explain…

9. The Role of Explanations on Trust and Reliance in Clinical Decision Support Systems

Clinical decision support systems (CDSS) are increasingly used by healthcare professionals for evidence-based diagnosis…

10. The Road to Explainability is Paved with Bias: Measuring the Fairness of Explanations

Machine learning models in safety-critical settings like healthcare are often blackboxes: they contain a large number…

11. Exploring the Benefits of Training Expert Language Models over Instruction Tuning

Recently, Language Models (LMs) instruction-tuned on multiple tasks, also known as multitask-prompted fine-tuning (MT)…

12. LongNet: Scaling Transformers to 1,000,000,000 Tokens

Scaling sequence length has become a critical demand in the era of large language models. However, existing methods…

13. Lost in the Middle: How Language Models Use Long Contexts

While recent language models have the ability to take long contexts as input, relatively little is known about how well…

14. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, Pontus Stenetorp. Proceedings of the 60th Annual Meeting of the…

15. Understanding the Origins of Bias in Word Embeddings

Popular word embedding algorithms exhibit stereotypical biases, such as gender bias. The widespread use of these…

16. In the Name of Fairness: Assessing the Bias in Clinical Record De-identification

Data sharing is crucial for open science and reproducible research, but the legal sharing of clinical data requires the…

17. Algorithmic Fairness in Chest X-ray Diagnosis: A Case Study

Machine learning models are being increasingly deployed in real-world clinical environments. However, these models…

19. Stop explaining black box machine learning models for high stakes decisions and use interpretable…

There has been a recent rise of interest in developing methods for ‘explainable AI’, where models are created to…

20. Machine Learning Model Drift

Types, causes, detections, mitigations, and tools

21. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations…

22. Health system-scale language models are all-purpose prediction engines — Nature

Physicians make critical time-constrained decisions every day. Clinical predictive models can help physicians and…

23. Efficient Diagnosis Assignment Using Unstructured Clinical Notes

Louis Blankemeier, Jason Fries, Robert Tinn, Joseph Preston, Nigam Shah, Akshay Chaudhari. Proceedings of the 61st…

24. Large language models are few-shot clinical information extractors

Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, David Sontag. Proceedings of the 2022 Conference on Empirical…

25. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a…

In this cross-sectional study, a chatbot generated quality and empathetic responses to patient questions posed in an…

Written by malcolmp