Summarizing patient histories with GPT-4

Published in

llmed.ai

10 min readApr 13, 2023

Photo by National Cancer Institute on Unsplash

This is the third in my series of articles on using ChatGPT for clinical use cases. It is a follow-up on my last article on my experiments with ChatGPT for clinical natural language processing where I tested the accuracy and ease of extracting clinical entities from individual free-text notes.

In this article my clinical collaborator, Dr. Sravanthi Parasa and I will conclude our discussions on the topic of information extraction. We discuss how ChatGPT Plus, aka GPT-4, can be used to extract clinically relevant information (aka “events”) from a set of clinical notes.

We define an event as an observation by a clinician, a lab, procedure or imaging finding that informs care for the patient.

Goals of Testing:

Patients with chronic diseases, cancers, prolonged hospital stays, etc., can have long and complex medical histories. If clinically relevant events can be succinctly and accurately extracted and summarized from their “longitudinal” health record, it can help multiple stakeholders

For clinicians:

An application integrating event extraction, embedded within the electronic health record (EHR), can provide a condensed chronological summary of the patients health. Physicians can use this information ahead of clinic visits, help enable safer transfer of care between providers, etc. Customized solutions, for example, can help flag gaps in care or deviations in care from evidence-based protocols.

For patients and caregivers:

An application that can ingest clinical records for a patient and summarize the relevant clinical events will help both patients and caregivers by giving them a collective understanding about their health.

Dr. Parasa and I have investigated both the extraction of clinical events and as well as Q&A on the input set of clinical notes.

Methodology:

The input to our system was a set of de-identified clinical notes. Prompt engineering was significantly more involved than prior experiments with conversational summarization and clinical NLP. This was because we had to deal with a set of notes, sometimes very large in volume.

We share results from two experiments.

Our first experiment involved a hospital discharge summary of a patient, from which we extract relevant clinical events.

The second was a set of clinical notes describing the hospital course of a patient. We had the system generate a “discharge summary” and compared it with the original clinician-written discharge summary.

We also asked the system to share the probability of “correctness” of the output and to explain the reasoning behind the output. Once we were satisfied by the output from the system, we engaged in Q&A and were specific in our prompt, in requiring GPT-4 to answer our questions based on only the text input that we shared.

In some cases, the system would lose context midway through experiments (and we were aware of the 25 messages/3-hour limit set by OpenAI). In those situations, we waited until we had access to GPT-4 and re-ran our experiments.

Conclusions:

Clinical event extraction was good. We were able to get the system to output information with varying levels of detail, in various styles, etc.
Evidence for the clinical events was accurate. Confidence expressed by the system (on accuracy of the output) was good.
The Q&A was interesting and accurate.
When our input exceeded GPT-4’s token limit, we had challenges with retaining context between prompts and sometimes encountered hallucinations. We were able to figure out a work-around (Experiment 2, Prompt 4). However, in the future, we’d use an orchestration engine to ensure that context is transferred reliably.

We think that a production-level clinical event extraction application, with guardrails to detect and address hallucinations, and in combination with a visualization tool, could provide much value to providers, patients and their families.

Experiments and their outputs:

In each example below, we’ll share the input (clinical notes), the prompts, and the output from GPT-4. We’ll also share the process involved in engineering the prompts appropriately to get the results we expected, and an explanation of the output.

Example 1: Extraction of clinical events for an ICU course

This was a fairly simple text for the system. The clinical course, while complex, had been summarized prior to input into GPT-4. The evidence (aka procedures/findings/management) was clearly articulated.

In the first couple of iterations, the system merely output what was input — but as bullet points. We had to nudge the system toward logical grouping of information. In this example, we did not encounter any hallucinations.

In this experiment, we show examples of event extraction with varying levels of detail, evidence for every clinical event along with the confidence of “correctness” as well as some Q&A.

Input:

To help you as you look at the input below, we’ve highlighted (in red) a few important events as well as details for Q&A.

Prompts & Outputs:

Prompt 1: Input is the hospital course of a patient. Generate relevant clinical events in the journey of this patient. For each event, put in an appropriate title. For example, Step 1: Admission and Findings. For each clinical event, include pertinent lab/procedure/imaging findings as well as pertinent negatives. I do not need information on management. The intended audience is a doctor. Tag this output as “LLM Output A”

Output 1:

GPT-4 generated a very accurate output here. Probably more verbose than needed for a busy physician. To help you as you look at the output below, we’ve highlighted (in green) a few important events

Prompt 2: Condense LLM Output A, retaining clinical steps. Include information of important lab and imaging results (including pertinent negatives). Tag this new output as LLM Output C. Intended audience is a doctor.

Output 2:

The output here is similar in content to output 1, but more succinct. Some highlights in green.

Prompt 3: In the context of this 82 year old patient, provide a very succinct explanation for the 4 clinical events generated for his hospital journey. Supply a couple of sentences of evidence for each clinical event. For each generated event, share confidence (high/medium/low).

Output 3:

Prompt 4: Q&A: Did the patient have pneumonia at presentation? Restrict the response to content input in this thread. If you do not know, respond “I do not know”.

Output 4:

Prompt 5: Q&A: Did the patient have sinus tachycardia on presentation? Restrict the response to content input in this thread. If you do not know, respond “I do not know”.

Output 5:

Prompt 6: Q&A: Did the patient have an advance directive on file? Restrict the response to content input in this thread. If you do not know, respond “I do not know”.

Output 6:

Example 2: Extraction of clinical events for multi-day inpatient hospital stay

The input for this experiment were clinical notes for a multi-day hospital course for a patient. Input consisted of an admission note, preliminary labs, specialist consult, hospital progress notes, endoscopy report and a repeat specialist consult note.

We also had access to the clinician-written discharge summary for the patient, which we “held back”. As part of our experiment, we compared the discharge summary generated by GPT-4 against the clinician-written discharge summary (prompt 3).

In this experiment, we show examples of event extraction with varying levels of detail, comparison of a GPT-4 generated discharge summary vs. clinician-written discharge summary, as well as one Q&A.

Input:

The input content is too involved to reproduce verbatim. We’ve included succinct summaries of each clinical note we input to the system. We’ve bolded pertinent clinical events.

Also,to address the token limit that we were encountering with GPT-4, we deleted some of the content in the notes. We re-surfaced this content in the Q&A section (prompt 4).

Clinical note 1: Admission note : 75 y.o. female with a history of COPD, CHF, DM II, HTN, HLD, TIA, DVT, PE presents with ongoing nausea and 10 lbs weight loss over the past 2 weeks. Recently off oxycodone, which caused constipation. No new medications. Hiatal hernia previously diagnosed. Started on omeprazole by PCP. Vitals: BP 146/71, Pulse 86, Temp 36.8°C, Resp 17, SpO2 91%, Weight 68.5 kg.

Clinical note 2: GI consult Note #1

75 yo female with postprandial nausea, dysphagia, and 8lb weight loss in 2 weeks. Suspected narcotic-induced gastroparesis. EGD planned to rule out reflux esophagitis, esophageal dysmotility, malignancy.

Clinical note 3: EGD procedure note

EGD report shows normal mucosa in the entire esophagus, stomach, and duodenum. A 4 cm hiatal hernia and Hill Grade IV gastroesophageal flap valve were identified. No specimens collected. Recommendations include continuing current medications, resuming previous diet and anticoagulant, and returning to hospital ward for ongoing care.

Clinical note 4: GI consult Note #2

75 yo female with postprandial nausea, dysphagia, and 8lb weight loss over 2 weeks. Suspected narcotic-induced gastroparesis. Normal EGD and UGI. Speech recommends bite-size diet due to ill-fitting dentures. Continue home PPI dose, minimize narcotic use, encourage p.o. intake per speech recommendations, and up and active as safely tolerated. Consider GES and/or esophageal manometry as outpatient if symptoms persist. Continue MiraLAX and consider short-term use of Reglan if needed. Patient to connect with maxillofacial surgeon for better fitting dentures. GI follow-up as an outpatient.

Clinical note 5: Discharge summary from physician.

75 yo female with COPD, CHF, DM II, HTN, HLD, TIA, DVT, PE, presented with ongoing nausea and weight loss. GI consulted; normal UGI series and EGD. Speech recommended bite-size diet due to ill-fitting dentures. Continued MiraLAX for stool burden. Goals of care not discussed.

Prompts & Outputs:

Prompt 1: Input is a sequence of notes with the longitudinal clinical journey of a patient. Generate relevant clinical events in the journey of this patient -i.e. generate a discharge summary. For each event, put in an appropriate title. For example, Step 1: Admission and Findings. For each clinical event, include pertinent lab/procedure/imaging findings as well as pertinent negatives. I do not need information on management. The intended audience is a doctor. Tag this output as “LLM Output A”

Output 1:

Here, we’ve highlighted events in green.

Prompt 2:

Condense LLM Output A, but include important findings lab and imaging results. Please retain the clinical steps. Tag this output as LLM Output B.

Output 2:

Prompt 3: I am a busy doctor and I’ve written down the following summary that follows (Tagged: “Orig Discharge Summary”). How accurate do you think “Orig Discharge Summary” is as compared to LLM Output B. <clinician-written discharge summary included>

Here, we also see an example of how GPT-4 edits the clinician-written discharge to include some missing details.

Output 3:

GPT-4’s revisions highlighted in green.

Prompt 4: Q&A:The 75 y.o patient was on the following medications. Use content from the previous chat only. Could what she presented with have been caused by any of these medications? < included a list of 25 medications>

Note that in this example, we re-surfaced this content that had to get deleted for prompt 1, as we were encountering a token-limit for input to GPT-4. We were not able to reliably transfer context from the previous chat. Often, the system hallucinated presenting symptoms, drowsiness, dizziness and confusion in this case. But interestingly and as expected, based on the hallucinated presenting symptoms, the system picked medications with the appropriate side-effect profile!

Output 4 (with hallucinations):

Prompt 4 (adjusted): Q&A:

Use the clinical journey included in the prompt (tagged LLM Output A). The patient was on the following medications. Could what she presented with have been caused by any of these medications? If so, give me that information in order of priorities, with the first medication to be the MOST likely at causing her presenting symptoms.

Here, instead of saying “use the context from the previous chat”, we input the clinical journey generated by the system. This reliably produced an output with no hallucinations! And you can see that it got the presenting symptoms and other details right.

Output 4 (adjusted):

Conclusions:

Physicians have limited time with their patients. They would be helped by context-aware, succinct, and relevant information about their patients, presented to them at point of care, and embedded into their workflow.

Large language models can be immensely valuable by extracting the significant clinical events of a patient’s journey in chronological order. Such an application of large language models can help provide context about the patient ahead of clinic visits, enable safer transfer of care between providers, and even assist the physician when she responds to patient messages. It would also be very helpful for patients with complex medical histories, to understand their health journey.

For sure, we’d need guardrails to detect and address hallucinations. And presenting the information visually, as a timeline, would be an added bonus!

We plan more articles with physician collaborators on other interesting applications of LLMs in medicine. Thanks for reading!

Summarizing patient histories with GPT-4

Written by Ranjani Ramamurthy