GPT for GPs? An AI-assisted scribe

Ranjani Ramamurthy
llmed.ai
Published in
8 min readFeb 15, 2023
Scribe, 2500 BCE, Grand Egyptian Museum, Cairo, Egypt

Would it be easy to build a high-quality AI-assisted scribe with the current generation of LLMs?

To test, I used a “system” that combines ChatGPT with prompt engineering. I used English-only synthetic text narratives between a doctor and patient (validated by US-based physicians), and I had the system generate a standard clinical note (aka SOAP note).

Before I get to the details, here are my findings. Details later in the article.

  1. The out of the box summarization from a doctor-patient conversation into a standard clinical note SOAP note is really good. When I asked for an “HPI” (a summarization of what brought the patient into clinic), I noticed some hallucinations (most often, the age or gender of the patient).
  2. In my experiments, I did not encounter any biased or inappropriate output.
  3. The system handled uncertainty in statements very well.
  4. It did quite well with temporal relationships (patient’s narrative of how long a certain symptom had lasted) and cleanly differentiated between a current problem versus a problem that occurred in the past (aka “past medical history”).
  5. The system did very well on co-reference resolution. Spot on with pronoun co-reference resolution. Few if any errors whatsoever in differentiating between a reference to the patient’s father versus a next-door-neighbor (relevant for medical history)
  6. The system did well on inferring meaning from context. A statement like “I was huffing and puffing in the garden” was interpreted to mean that the patient was experiencing “dyspnea on exertion”.
  7. The system was able to incorporate jargon and standard clinical abbreviations/acronyms while generating notes. I tested many different prompts here to be inclusive of the different styles that clinicians use while authoring notes.
  8. My physician friends found the system-generated differential-diagnosis to be pretty thorough (but sometimes naïve).
  9. Extraction of clinically relevant concepts (or entities) and their mapping to standard clinical ontologies, was interesting. I think this requires a post in itself, with an explanation of the common challenges with clinical NLP and how well the current generation of an LLM (with no custom data or training) performs. I plan to author a post with the input of an expert clinical informaticist.

Given that the system is probabilistic, I did see some better and some average outputs. But overall, I was pretty happy with what I saw!

Background on EmpowerMD’s AI-assisted scribe (2017–2020)

In 2017, I founded a team at Microsoft Research to build EmpowerMD, an AI-assisted scribe for physicians.

A lot of creativity went into building Microsoft’s first product. We then landed a partnership with Nuance communications (now a part of Microsoft) and co-developed DAX (Dragon Ambient Experience) for Microsoft Teams, within a year of that partnership. Nuance is now part of Microsoft and DAX continues to be used by clinicians. It was an amazing journey and one of my best professional experiences.

We used a mix of some older generations of LLMs and custom ML models and licensed de-identified data to build the AI-assisted scribe.

Let’s start with some context: What is an AI- assisted scribe?

An AI assisted scribe starts with recognition of audio conversations between doctors and patients, runs the audio through automatic speech recognition engine. After some processing of this output, the conversation is converted to a clean textual dialog. An AI based system then extracts semantic information from this narrative (what’s clinically relevant for a physician), extracts context from the patient’s health record (their past history) and then generates a clinical note for the doctor. The doctor can then edit the clinical note as she sees fit. The AI scribe is a learning system that learns from the doctor’s edits and improves over time. Here’s a short video of the first version of the product we built at Microsoft.

Why did we build an AI assisted Scribe?

2017 was 8 years after the passage of the HiTech act, which had mandated the digitization of medical records. During this time, literature emerged showing that physicians were spending more time interacting with computers than with their patients, a concerning trend that was one cause of burnout.

Our conversations with physicians revealed that they had a need for products where technology could help. We honed in on a set of tools that could reduce the time they spent on administrative tasks to get back to doing why they became physicians in the first place — to care for patients.

Physicians wanted tools to summarize longitudinal patient information succinctly, manage their overflowing email inboxes and a product to assist with creating clinical notes after a patient visit. There was consensus that an “intelligent scribe” (an AI-assisted scribe) was what they wanted first. We chose to work on building that product.

The thought was that with the advances then with speech recognition and natural language processing technologies, it made sense to build an “ambient intelligence” technology for the clinic. This is, a system that un-obtrusively listened to the clinician-patient conversation, transcribed the narrative, extracted semantic meaning from the discussion and generated a clinical note. The product idea was to make the clinician the editor rather than the creator of the clinical note.

The pitch was simple. “If this product is in clinic, the doctor should be looking at you (the patient), rather than at her computer”..

My recent experiments with LLMs

With all the genuine enthusiasm around the latest versions of LLMs, I wanted to understand if the underpinnings for the technology for the AI-assisted scribe would now be easier and perhaps, more accurate. Before I get going, here are some videos by Steve Seitz on an introduction to LLMs (part-1, part-2).

My curiosity was to do with whether our ‘cold-start’ problem would be any easier today and if one could build a high quality AI-scribe.

For sure, we’d still have to build a compelling user experience, embed the product within an EHR that the clinician uses in her workflow, and solve the many business and engineering challenges to make it a successful product.

Here are the steps in the process:

  1. The input to the system, was dialog (in text) of a conversation between a doctors and patient. The dialogs had anywhere from 20–50 ‘turns’ (back and forth) between the doctor and patient.
  2. Summarization of these dialogs into a clinical note. I played with formats here. From basic summarization (“HPI”) to structuring content into a “SOAP” note (the standard format of a clinical note).
  3. Enhancement of the clinical note (in 2) with more contextual information. I did this by a series of prompts, giving the system more or refined information at each step (for example, I might feed information about lab tests or the findings from an imaging study). My goal was to see if the content of the clinical note improved. I analyzed the generated clinical note at each step, looking for hallucinations, incorrect or vague information.

Here are the details

I worked with some practicing primary care physicians to generate gnarly and very realistic conversations with the kinds of cases they encounter on daily basis: cases with complex medication management, chronic diseases, acute infections, vague/ non-specific reports of symptoms requiring some investigation, musculoskeletal issues, etc. I tested with both “clean” dialogs as well as dialogs with disfluencies (e.g. ‘umm’, ‘hmmm’) and discourse markers (e.g. ‘like’.. ). Often I’d have to refine the prompts several times to get the system to generate the output I was expecting. I used both the OpenAI playground and ChatGPT. Outputs appended below are are from ChatGPT.

Example 1:

This example illustrates simple summarization, temporal relationships and expansion of some medical terms. Two different styles of reporting the information as well.

Patient is young, a good historian and not on any medications. No co-morbidities.

Gist of the conversation. I include this in lieu of the full narrative.

A doctor and a patient had a conversation about the patient’s ear pain. The patient reported having pain in both ears, fullness, and gunky drainage. The doctor asked if the patient had any changes in hearing, pain, drainage, or history of ear problems, and if the patient had tried any treatment. The patient reported temporary relief from taking ibuprofen.

Here’s the generated clinical note:

I then gave the system some contextual information. Needed a few prompts to get this right.

The same note, in long-form.

Example 2:

More complex narrative. 40+ turns. Patient has co-morbidities, takes many prescription medications and is not linear in their communication. This was a good example of dealing with uncertainty in statements, disambiguation and summarization of a rather long and complex conversation.

Here’s the generated version (subjective and assessment sections)

Here’s the example transcript where the doctor thinks that the patient has jaw pain on the right, but is the left. The system does well in disambiguating between right (jaw) vs. right (correct). This would have been a bit of a struggle before.

… (after approximately another 10 turns)

Next my clinician friend and I fed some very realistic contextual information: labs as well as physical exam findings. This was the output of the system. The plan was not exactly what my friend says they’d have done — but close enough.

And the most succinct note? This shows language generation in different styles.

Example 3:

Long narrative with a patient with who has a genetic disease, multiple co-morbidities, who has difficulty breathing. Prompt was to get the system to create an HPI (History of Present Illness), to be succinct and to use acronyms that a physician would use.

Another example of experimenting with different styles of note writing.

In reality, a wealth of information is contained in a clinical note. In my next article, I plan to test Clinical NLP out of the box. I’d like to share my findings on how an “out of the box” system deals with extracting relevant information, handling negation and uncertainty in statements, deals with synonyms and variations in wording, temporal expressions, etc. I will also try to cover how well these entities are mapped to standard ontologies.

Thanks for reading!

--

--

Ranjani Ramamurthy
llmed.ai

Product Management, MD, Cancer Research, Engineer, Health-Tech advisor, GH Labs, ICGA, Fred-Hutch, LLS, ex-Microsoft, pediatric cancer research advocate.