The Double-Edged Scalpel: Balancing Innovation and Safety in AI Healthcare

malcolmp
Healthcare AI SIG
Published in
15 min readJul 24, 2023

--

Technologies like ChatGPT, Bard and Med PaLM are tremendous improvements in AI, but it’s too early to trust them for medical advice. This article outlines the problems using generative and predictive AI in clinical settings, and some ways to use them safely.

Table of Contents

Introduction

The current discourse on AI strangely mirrors this weekend’s big movie releases. On the one hand, Oppenheimer reminds us of the ethical and societal responsibility that comes with developing powerful and potentially dual-use technologies. On the other hand, venture capitalists and start-ups align more with Barbie’s “Let’s go party” message: AI companies raised US$25 billion in the first half of 2023, representing 18% of global funding.

Recent advances in generative AI, particularly large language models (LLMs)[1], have demonstrated potential benefits to healthcare in various areas, including diagnosis, summarization, clinical noting, disease prediction, and patient communication.

Products incorporating generative AI are coming to market faster than previous AI technologies, such as image recognition and predictive models, which had barriers to implementation including data integration and “bothersome” processes like testing for clinical safety. In contrast, a clinician can cut and paste data into ChatGPT to get advice or write notes, ignoring privacy concerns.

Understanding the strengths and weaknesses of this technology is essential and urgent.

The problem is that we don’t know how LLMs store and process knowledge. We have created complexity we don’t fully understand, so we resort to experimenting on these models, akin to examining a new biological creature; we poke them (with prompts) and see how they respond.

Thanks to research from around the world and the proliferation of smaller open-source LLMs that are easier to experiment on, our knowledge of the strengths and limitations of this technology is rapidly growing. This article highlights some of the limitations of generative AI and machine learning (ML) which we should seek to address when considering their use in healthcare, and some areas where we can use them safely.

Hallucinations & Omissions

In the realm of health, an AI’s hallucination can be a captivating illusion; it is our duty to distinguish the light of evidence from the shadows of algorithmic imagination

ChatGPT on AI hallucination in healthcare, in the style of Aldous Huxley

LLMs like ChatGPT generate a new sequence of text based on some input prompt. LLMs are trained on vast quantities of textual data (we will stick to text-based LLMs for this article) and are fine-tuned for specific tasks or better alignment with human discourse.

Let’s say we wanted an LLM to summarise a patient’s medical history. In the prompt, we provide as context a series of discharge summaries and a request to summarise them. The following types of errors could occur:

  • Critical omission of important facts, such as ignoring a significant diagnosis documented in the discharge summaries;
  • Hallucination is where the LLM inserts a diagnosis or treatment that is not present in the provided discharge summaries; this may also include citing sources that don’t exist (this is also called extrinsic hallucination);
  • Incorrect facts where, say, the dose for a medication is incorrectly copied from a discharge summary (also called intrinsic hallucinations).

Initially, researchers thought hallucinations occurred when an LLM did not know a fact — it would make it up! However, hallucinations still happen when the relevant facts are present in the prompt [2]. To make matters worse, LLMs can generate hallucinations that appear plausible, and they will attempt to justify the hallucination with plausible explanations [3].

LLMs are particularly prone to hallucination when presented with a query on topics not part of their training data, known as an out-of-distribution (OOD) query. Smaller models naturally have less capacity and are usually trained on less data than massive models such as GPT-4, so they are more prone to hallucinations [4]. In production, smaller models require better guardrails to ensure we use them within the scope of their training.

Another concern is the need for Australian-created models. OOD issues and subsequent hallucinations can arise if, for example, a clinical model is trained on US data and then used on Australian clinical notes that may use different idioms and terms (tokens) not present in the original model. There may need to be more than just fine-tuning to resolve these issues.

Hallucination Mitigation

Hallucination rates and the effect of prompt structure [3]

As we can see from the chart above, the prompt structure can lead to high rates of hallucination. The research shows we can dramatically reduce the hallucination rate by modifying prompts, for example, asking for step-by-step reasoning or providing a specific output format.

Another method that reduces the hallucination rate is retrieving the most pertinent information required for the LLM to answer the question and include it as context in the prompt. This approach, called retrieval augmented generation (RAG), reduces hallucinations but is not a cure [5].

On Tests and Testability

The task of examining generative AI is not unlike attending a lively ball. Every dance, though guided by the same music, bears unique turns and steps.

ChatGPT on non-deterministic generative AI, in the style of Jane Austen

In most jurisdictions, software is a medical device that invokes a set of regulations for a product’s development and testing. For a software product to be testable, it must be deterministic: the output will always be the same for a given set of inputs. LLMs used for chat applications are non-deterministic. The same input will not result in the same output.

How can we guarantee a product is safe if we cannot test it? Especially when we know it can hallucinate and omit information.

Always test the AI, no matter how cute and clever it appears.

Software built on large commercial LLMs, such as ChatGPT (OpenAI) and Bard (Google), incur further challenges to reproducibility. LLM vendors can update models anytime as they refine them with fine-tuning and reduce unwanted behaviour (censorship). Each model update changes its behaviour, requiring re-testing and re-validation of any downstream systems that use these foundation models [6].

Clinical software vendors can reduce the risks incurred by version change by having their own instance of an LLM, so updates and testing are under their control. Running a dedicated instance of a massive LLM such as GPT-4 may be very expensive, but there are now smaller open-source models that make this option more realistic. As mentioned in the previous section, smaller models have separate safety considerations.

We’ll see later how LLMs can be used in non-generative ways to improve testability.

The Complex Issue of Explainability

Should clinical AI be explainable? This will be a topic for another article. For this article, I pose that AI will be most helpful when we face complex decisions — in which our pattern recognition fails, and we must consider many variables. In these situations, a computer can quantify the interaction of more variables we can keep in our heads and draw on complex multivariate distributions from data.

Would we even understand if the computer tried to explain how it found the optimal projection in n-dimensional space with non-linear hierarchical interactions? Probably not. (But it might be good to know what variables it did consider and what evidence it used, if any.)

LLMs are black box models — they cannot explain their workings. The big problem with LLMs is they give great explanations! Unfortunately, these explanations have nothing to do with how they produce an answer [7]. We make it worse by fine-tuning models with Reinforcement Learning from Human Feedback (RLHF), which makes a model more likely to give explanations it thinks we like to see [8].

Generative AI training allows the models to create eloquent explanations that don’t reflect the complex computations behind their recommendations. This ability creates the risk of automation bias, where clinicians will be more likely to follow bad advice and ignore contradictory evidence when presented with convincing but incorrect explanations [9].

Most machine learning (ML) models are black box models. However, few are as seductive with their explanations as LLMs. For non-LLM models, numerous tools attempt to generate explanations by creating a simplified model from which they generate explanations. These simplified explanations (a) do not reflect the underlying ML model and (b) can amplify bias to minority subgroups in the data [10].

Updating Models and Prompt Engineering

LLMs are very expensive to create. GPT-4, which has around 1.8 trillion parameters, is estimated to have cost OpenAI over US$100m to train, ignoring all the work that’s gone on in fine-tuning using RLHF that requires actual humans. LlaMA, Meta’s 65 billion parameter open-source model, probably costs US$4–5m to train. Assuming they got it right the first time.

The high dollar and energy cost of training models precludes frequent re-training to keep models up to date with the latest evidence from the literature.

Prompt engineering is essentially trail and error.

Fortunately, it’s much cheaper to fine-tune models for specific tasks, often thousands of dollars and much less for smaller models. However, fine-tuning can cause performance degradation and forgetting previously learned knowledge [11].

A more straightforward way to use up-to-date knowledge is to include the information as context in the prompt sent to the LLM. The amount of text an LLM can consider is called the context window. We measure context capacity in tokens, where about 3 words = 4 tokens. Initially, GPT-3 allowed 2,048 tokens (about 1,500 words) which include the prompt and the output from the model. If information falls outside the context window, the LLM can’t use it for reasoning, so models forget earlier parts of a conversation.

LLMs are surprisingly good at in-context learning (ICL); they can integrate information provided within the prompt (context) with the knowledge and reasoning skills they have learned during training. So having more prompt capacity is a significant advantage, but before 2023 increasing the context window drove up the cost of training and running LLMs.

The last few months have seen tremendous breakthroughs in expanding the context length [12] to potentially millions of words without the previous disadvantages. A huge context window could allow a health record system to incorporate years of clinical notes and reports into the prompt, letting the LLM reason across a patient’s whole health history.

Large context windows can also turbo-charge the retrieval augmented generation (RAG) technique mentioned previously [4].

Recent analysis has demonstrated some emerging problems with the expanding context window. LLMs with long context windows ignore information in the middle of the prompt and are more accurate with information at the beginning and end [13].

LLMs use information effectively at the beginning and end of its context. Documents in the middle are not effectively used [13].

Another fly in the context window ointment is that LLM performance is brittle to slight modifications to prompts, including the order of examples within the prompt and the structure of the information and query [14]. Small shifts in content within the prompt can cause performance degradation and hallucinations. Frustratingly, lessons on optimal prompt construction do not transfer between models. The need to keep up with prompt idiosyncrasies has spawned a new career path called prompt engineering.

Equity

“AI, whispered into existence by human hands, can sing the unsung biases; it is our responsibility to tune its melody towards harmony”

ChatGPT on AI model inequity, in the style of Maya Angelou

While not specific to LLMs, all machine learning models are descriptive systems — they learn from the past including its biases and inefficiencies.

Guaranteeing fairness in LLMs introduces further complexities than regular ML models. LLMs are subject to bias before primary processing begins, surfacing within the initial word encoding stages (embeddings) [15]. Subsequently, bias can emerge within neural networks based on the characteristics of the training data used to build them.

It is tough to remove bias from training data [16], and ML models have been able to learn racial characteristics from text and even X-ray images [17]. These models reflect the disparity in our system, and it takes work to compensate for these problems.

Specific ML models may be more prone to bias than others, even when trained on the same data. For example, I contributed to a recent study [18] showing that a Logistic Regression model displayed more bias to gender and indigenous status than an XGBoost model trained on the same data (chart below).

Performance on gender of 2 models trained on the same data: XGBoost (left) vs Logistic Regression (right) [18].

Ensuring fair models is a complex subject requiring considerations across the machine-learning process. When using foundation models from other sources, it’s hard to understand how inbuilt assumptions may affect the patients under your care.

Other Issues to Consider

Commercial Bias Towards Unecessary Complexity

For every large investment in an AI startup, one or more investors are waiting for a significant return on their money, usually within 1–4 years. This commercial pressure pushes into the market solutions that may be more complex than necessary. If a product is too simple, companies can’t easily protect intellectual property (IP) and can’t price it at a premium. So they prefer complex models over simpler explainable ones that, in many cases, can function just as well or better [19].

With the excitement around LLMs and ChatGPT, it is easy to forget that the biggest problems in healthcare are unwarranted clinical variation and patient safety. The solution to these problems requires prescriptive systems (rules, decision trees, pathways, statistical models) that assist clinicians in implementing what we already know in ways that reduce cognitive burden (workflow integration). Descriptive/predictive ML models can help, but not all problems require them.

Uncertainty Management

As discussed earlier, LLMs, and deep learning models generally don’t do well with out-of-distribution (OOD) inputs. Ideally, a model should be able to identify if the inputs it received are within its distribution of training and express low confidence in the output. That doesn’t happen. LLMs and other neural networks will confidently make predictions on OOD inputs.

Other models better at assessing OOD data may be required to function as guardrails for LLMs to ensure they operate within the specified parameters.

Model Drift

All predictive models can suffer degradation in performance over time as underlying data changes or process changes result in changes in labelled data use [20]. Teams should monitor models over time for drift, which is relatively straightforward in predictive models, but methods to do this for generative AI still need to be developed.

Drift may not only occur at the top-level performance. Overall model performance can remain acceptable while models become increasingly biased — requiring ongoing subgroup analysis.

Other Issues Not Covered

While I’ve covered the main issues relating to LLMs there are many other generally applicable issues to consider when looking at deploying any ML or AI into production:

  • Privacy and security
  • Regulatory oversight
  • Integration with existing systems and data quality
  • Training and evaluation
  • Workflow integration and monitoring
  • Guardrails for defined use and blocking adversarial inputs
  • And I’m sure there’s more…let me know what I’ve missed!

Safe Uses for LLMs

Predictive Tasks

Numerous researchers have shown LLMs can be effective for predictive (non-generative) tasks. The transformer architecture that underlies LLMs comprises encoder and decoder components. The decoder generates text sequences, but if we don’t need that functionality, we can use the encoder independently to use natural language understanding for prediction tasks. Google’s original transformer in 2018 was an encoder model called BERT [21]; many of the predictive models today are based on BERT.

Encoder models are several orders of magnitude smaller than decoder models and perform well using clinical notes for prediction in outcome measures (predicting mortality, length of stay, etc.) [22] and clinical disease prediction [23].

Feature Extraction from Clinical Notes and Reports

Sometimes we want to know specific information from the medical history. For example, does a patient have a family history of heart disease? While this seems a simple question, it’s hard to automate because clinical notes are usually unstructured (there is no specific field for family history). Furthermore, clinicians can use a variety of semantics, such as “Father died from stroke”, “Mother had stent for ACS”, or “FHx: NA”, and so on.

LLMs understand the nuances of language and are good at handling varying expressions and negations to extract specific clinical information of interest [24]. Once extracted, we can use these features in risk scores, clinical trial eligibility algorithms, or explainable models.

Pre-Prepared Generative Content

Not all generative AI uses need to be ‘live’ and direct to the patient. Recent research indicates that patients like the empathetic output from ChatGPT for answering medical questions [25], but the risk of inaccurate information remains a barrier.

Another approach is to map patient queries to validated information. We often know the kinds of questions patients ask, and we can generate content addressing these questions for different health literacy levels, cultural backgrounds and languages using one or more LLMs. Domain specialists can vet content to ensure it is accurate. In a chatbot, a patient query that requires clinical information can be mapped to the closest approved content, similar to the methods used in retrieval augmented generation.

Conclusions

You may think I’m against AI innovation by raising all these issues. However, I genuinely believe that we risk slowing innovation if researchers, vendors and customers do not consider what it takes to deploy this new AI technology into clinical settings safely.

It’s not clear that LLMs alone can avoid hallucinations and some of the other shortcomings mentioned; it may be that a combination of different technologies work together to make AI systems robust for clinical use. You may rest assured many people are working on finding solutions to these problems.

There is a ‘gold-rush’ mentality to publish articles and launch products which do not provide potential users with the information required to make responsible decisions on how (and if) to use AI tools for clinical use. I hope this has helped to inform you on what questions to ask.

References

18. Equitable Machine Learning for Hypoglycaemia Risk Management. Rodriguez, J,
Padilla, D, Bruce, L, Thow, B, Pradhan, M. MedInfo 2023, Sydney, Aus.
In Press.

--

--

malcolmp
Healthcare AI SIG

Adjunct Professor Digital Health University of Sydney, Entrepreneur. MBBS, PhD, FAIDH.