We took ChatGPT in for a Clinical NLP checkup

Published in

llmed.ai

9 min readMar 6, 2023

In a prior article, I’d experimented with the use of ChatGPT + prompt engineering for summarization of clinical narratives resulting in the generation of a clinical note, a document that records a patient’s clinical status.

In this article, I explore the use of ChatGPT “out-of-the box” (i.e. without any fine-tuning or re-training) for a broad class of Clinical NLP applications that require the extraction of clinical content from text. With a robust clinical NLP engine, a variety of clinical applications (for e.g. applications that detect patients eligible for clinical trials) and administrative applications (for e.g. revenue cycle management systems that enable payment for healthcare services), become easy to automate.

ChatGPT was impressive at its ability to handle clinical information extraction. However, just out-of-the-box, it did somewhat better at summarization (the topic of the previous article) . Information extraction, on the other hand, needed thoughtful prompt engineering and experimenting. I anticipate that fine-tuning the underlying model with a limited corpus of tailored, clean data (such as health system clinical notes) will be required to enhance the accuracy and address any hallucinations of the ChatGPT engine in a clinical NLP setting.

Goals of testing:

In order to explore this topic deeply, I collaborated with Senthil Nachimuthu, an MD clinical informaticist and Chief Medical Officer of Nightingale Open Science. This is a summary of our joint explorations.

We tested some core information extraction tasks:

(a) Named entity recognition (recognizing clinically relevant entities)

(b) Temporal information extraction (association of the concept of time to the clinical entity)

(d) Negation detection (detecting positive/negative/neutral/unclear attributions to the entity)

(e) Medication extraction (interpreting name/dosing and mode of administration)

We had stretch goals for each of the core tasks, detailed below in the examples. While testing these core tasks, we also tested disambiguation (of acronyms and jargon) as well as disjoint-mention of phrases.

Methodology:

Our input to the system was a corpus of de-identified clinical notes. Information extraction required experimentation with prompt engineering (i.e. repeated trial and error with different prompts to elicit the best behavior from the ChatGPT engine). We also asked the system to share the probability of “correctness” of the output and asked the system to explain the reasoning behind the output. The output was adjudicated by a team of clinical subject matter experts.

This is a manual exercise. The experiments were creative and iterative and the assessment of the output is subjective and not objective (no F1 scores, here). ChatGPT’s output is probabilistic. So, YMMV!

Conclusions:

Named entity recognition: This involved the identification of clinical entities like diagnosis, symptoms and medications. The performance of the system was good. In some cases, we did not agree with what the system chose as a “diagnosis” (for e.g. an allergy to a medication), but our team of subject matter experts agreed that the output of the system was in the threshold of expected outputs.
Temporal information extraction: The task here was to accurately associate the concept of time to the clinical entity. For example, while interpreting a diagnosis, help determine the patient’s current active complaint vs. past medical history (PMH). The stretch goal here was to see if the system could infer if the past medical history was a chronic condition vs. a condition that was potentially resolved (say, hypertension vs. pharyngitis). The system did well.
Pronoun co-reference resolution: A simple but important example in clinical notes involves distinguishing whether a diagnosis being referred to, is pertinent to the patient or a patient’s family member. The results were good.
Negation detection helps distinguish between positive, negative, neutral and unclear statements. For e.g. deciding if a particular symptom was present or absent. The output was accurate.
Medication extraction: We looked at extraction of all medications, mode of administration, dosing, etc. The system nailed this component of the testing exercise.

Most experiments tested disambiguation — the system’s ability to correctly expand acronyms, interpret jargons and abbreviations. The system was more right than wrong.

As mentioned before, information extraction required explicit step-wise prompts to work well. The extent of experimentation and prompt engineering was proportional to the complexity of the clinical notes and the task at hand.

Experiments and their outputs:

In each example below, we’ll share the input (clinical note), the prompt, and the output from ChatGPT. We’ll also share the process involved in engineering the prompts appropriately to get the results we expected, and an explanation of the output.

Example 1: Extraction of entities (diagnosis) with disambiguation and temporal co-reference resolution

This was a rather complicated example of a patient with multiple chronic conditions. Diagnoses and pathology were mentioned multiple times (and sometimes with a resolution)

We’ve highlighted some interesting parts of the clinical note.

Input:

Prompt:

Extract all clinically relevant diagnosis ONLY from this blurb of text. Differentiate between PMH and current chief complaint. Note if the patient has had a procedure to solve a pathology. Note if the same diagnosis has been mentioned multiple times with a resolution. Presenting symptoms are not a diagnosis. Allergies tend to not resolve over time. Give me the probability of how sure you are about the results (low/medium/high). Output the information as a table. Schema: Diagnosis | PMH Chronic / PMH resolved / Active- Resolved / Active | Probability

Output:

Explanation of the output:

The output is as we’d expect.

We were challenging the system to not only extract all clinical diagnosis but to get the status of that diagnosis correct. A diagnosis could be in the past but could have been a problem that was resolved or is now a chronic condition for the patient. The diagnosis could also be part of the current set of issues the problems that the patient is dealing with, but could be a transient problem.

We also asked the system to give us a probability of how “right” it thought the answer was and to output the results as a table(with a schema we proposed).

This is a long but very interesting example and took many iterations of prompt engineering to get right. It was also used in testing an aspect of co-reference resolution.

Some examples of prompts -

“Note if the patient has had a procedure to resolve a pathology”

“Recognize if the same diagnosis has been mentioned multiple times, with or without a resolution”

“Allergies do not tend to resolve over time”

“Presenting symptoms should not be confused with diagnosis”

Here’s the explanation of the output from ChatGPT.

Note that we do not know if the CABG and the AAA were during the same hospital course. But, with the context provided in the note, it is a reasonable explanation.

The explanation for probabilities is interesting.

One could argue that the “patient’s renal function returned to normal over several days” should mean that the system has a high confidence in saying that this condition was resolved. I tried it — and this is the output.

Example 2: Extracting symptoms + a curveball on disambiguation

This exercise was to extract symptoms where there was some uncertainty.

Input:

Prompt:

Extract all symptoms only from the text below. Differentiate between symptoms that are present vs. absent. Give me the probability (high/ medium/ low) of how sure you are about the result. Add a note on the probabilities and why you think so. Output as a table. Schema: Symptoms | Present/Denies | Probability. Expand all acronyms. Output as a separate table.

Output:

Explanation of the output:

Expansion of acronyms:

A curveball:

“SBP” was not part of the clinical note we initially started off with.

Senthil and I wanted to test disambiguation for the acronym “SBP”. We were trying to test if SBP would be expanded to “Spontaneous Bacterial Peritonitis” or “Systolic Blood Pressure” and understand the reasoning — given that SBP is a past-medical-history and hypertension (HTN) is also mentioned in the clinical note.

Here’s an interesting explanation from the system on the reasoning for the disambiguation.

We went back and forth a bit — but the system steadfast maintained that the probability of this being Systolic Blood Pressure was higher. Given that Spontaneous Bacterial Peritonitis is a common infection for patients with cirrhosis of the liver, and that is not in this patient’s history, we moved on.

Example 3: Pronoun co-reference resolution

This is an exercise to get patient’s history vs. the family history correct.

Input:

Prompt:

Extract the diagnoses only. Differentiate between PMH, FamHx and chief current complaint. Output the information as a table with the following Suggested Schema: Diagnosis | Status (PMH/ FamHx/ Current Complaint| Relationship to patient ). Tell me the reasoning for why you’ve output the diagnoses.

Output:

Explanation of the output:

We were investigating difficulties in identifying the individual associated with a specific diagnosis. A patient’s family medical history can have relevance to their present condition. The system was able to accurately handle this.

Example 4: Medication extraction

This was the simplest of all examples we’ve shared.

Input:

Prompt:

Extract all medication information from the snippet below. Expand abbreviations. Output as a table: Medications | Strength | Dosage. Make sure that you understand what qam, qpm and qid mean.

Explanation of the output:

Medication extraction has always been an interesting problem in clinical NLP. In this experiment, there was a complex set of medication instructions on discharge. The system needed a bit of a reminder for a few of the acronyms but otherwise nailed this.

Stretch tasks: Event Extraction, Q&A

We tried two other tasks on involved and complex longitudinal summaries. The input was either a patient’s histories across many years or a prolonged hospital course.

The tasks were: Clinical event extraction and Q&A. In the former use case, we had the system extract the most important clinical events that guided the patient’s clinical journey. In the latter, we asked the system specific questions, to test for accuracy.

These experiments required significant prompt engineering. We plan a separate article on this topic as we think that with domain specific fine tuning, these tasks could help activate some very impactful clinical use cases.

In summary, clinical NLP tasks using ChatGPT perform well out-of-the-box. This makes it easier for companies looking to build useful applications, as they do not have to build their own large language models for clinical text. Scientists will need less data for custom training and tuning their models, which in turn reduces the cost associated with cleaning and labeling. In the past, I’ve seen that building valuable applications is not enough. Being transparent about the source of the data, and building explainable systems are what has gained trust and traction with customers. And, answering that all-important question of how all this technology is going to be paid for! Those challenges do not change. But, it sure will be easier and quicker to get going on providing value.

Thanks for reading. I plan more articles with physician collaborators on other interesting applications of LLMs in health.

We took ChatGPT in for a Clinical NLP checkup

Written by Ranjani Ramamurthy