I trained a healthcare LLM Model over a single weekend!

A medical bot for Ob-Gyn’s. But does it show that the current breed of clinical LLM’s are on the wrong path?

Nihar Ganju MD
6 min readJun 28, 2023
Photo by Daniel K Cheung on Unsplash

In the spirit of rapid prototyping, I trained a clinical LLM in a weekend. Yes, you read that correctly.

Specifically, I took a knowledge set from the field of Obstetrics & Gynecology and encoded it to a large language model (LLM). Result: a LLM model aligned to women’s health. I’ll describe my process, the tools, and performance of this LLM. Ultimately, this project sheds light on an important question. Which datasets yield the best practical use cases in clinical AI.

Introduction

In the wake of ChatGPT’s release, there are so many examples (and warnings) of using generative AI in the clinical context. This has demonstrated the use cases doctors and patients want and the eagerness of everyone awaiting AI of clinical caliber. Thus far, Google has been in the game, researching Med-PaLM. And startups like Hippocratic.AI are stepping in with massive financing rounds to develop proprietary models.

Google and Hippocratic both demonstrate performance of their models by emphasizing success in medical boards and speciality exams. Frankly, as a doctor, that benchmark means very little to me. (I’ve never met a doctor who highlights their USMLE scores to impress me).

So instead of waiting to get my hands on one of these systems, I decided to build one myself in the area of my expertise — Obstetrics and Gynecology.

Method

Data

For Ob-Gyn’s in the US, ACOG (American College of Obstetricians and Gynecologists) publishes the most high-quality and trusted practice guidance for women’s health. ACOG has a series of articles as Practice Bulletins, Committee Opinions, Clinical Practice Guidelines, etc.

You will not pass your oral boards if you don’t know the Practice Bulletin’s cold. So every resident reviews and references this material. Needless to say, every freshly minted Ob-Gyn is expected to know this stuff. So this is what I used.

Here is a summary of everything available from ACOG. I already had a previous version of most of these files (as text and PDFs) from 2015–2016 when I took my boards.

LLM Model

In the recent months, the AI tech stack for LLMs has coalesced into a reliable, repeatable method for LLM development. Take an existing off-the-shelf LLM, then extend it with the dataset you want mapped to a vector database. So this was also my framework:

  1. Convert your source dataset into embeddings
  2. Store your embeddings in a vector database
  3. Apply a query/response interface to interact with the LLM
  4. Engineer and test prompts until you’re getting the desired results

Less than a week ago, Matt and Rajko at a16z succintly described this architecture (see figure). Going forward, this summary should be essential reading for anyone looking to build their own prototypes.

Source: a16z

I used:

  • LangChain as my orchestration framework. Their off-the-shelf chains make it really easy to get started.
  • Openai text-embedding-ada-002 as my embedding model.
  • Pinecone as my vector store.
  • Openai chat models text-davinci-003 and gpt-3.5-turbo were both tested. I also tested different prompts with these models. In either case, the same Pinecone index was presented for either model.
  • Telegram bot as my chat UI
  • Pipedream to connect the Telegram bot to my app
  • App was hosted on Render

I did not perform data labelling or reinforcement learning in this project.

UI/UX

Telegram makes it really easy to create a bot and the interaction is as familiar as any instant messaging app. I wanted to share my app with my close physician and nurse peers for performance feedback. And this was going to be the most accessible way.

Results

While I had no intention of testing this model against any Ob-gyn boards question bank, I wanted to test a common real-world scenario: texting your chief resident.

So, I prompt engineered a Chief Resident/Intern role-play. I wanted to know, could the model perform at the caliber of a chief resident if I was an intern?

The Prompt:

Your name is Sami, an ObGyn chief resident. 
In your response, you are snarky and insult my intelligence because I am an intern.
All answers are based on the information given.
If the answer is not included, say "I don't know that." and stop.
Refuse to answer any question not about the info. Never break character.

I gave the bot a human likeness — Dr. Sami Muffarij. Sami is a dear friend and a real-life doctor. The mythical Sami-AI bot may be snarky, but in real life, Sami is a genuine gentleman physician who is legendary as an instructor at The George Washington University and has an honest dedication to the field of Ob-Gyn. So he was the first in Doctor in my mind I thought about cloning for this scenario. Thank you Sami for letting me use your name.

So how did Sami-AI perform? Here are some examples:

Sami was always smart.
Not bad.

In my first iteration of testing, Sami-AI gave very textbook style answers. The responses were long-winded, and still incomplete and generic. But Sami-AI could be nudged towards the complete answer as above, and the bot even understood that “mag” is shorthand for magnesium sulfate.

I attempted to improve the performance first with prompt engineering and encouraged the bot to ask me clarifying questions if necessary before responding.

Notice here that Sami-AI again gave quite a verbose answer. Let’s compare this with an untrained off-the-shelf Chat-GPT.

Performance against ChatGPT

When comparing responses, ChatGPT gives us an OK answer. The untrained ChatGPT did not offer a harmful medication, but it’s not the best answer. By contrast, Sami-AI does give us the best answer: Cephalosporins with risk assessment for anaphylaxis. Now this is the type of answer they look for on the oral Ob-Gyn boards.

Also, Sami-AI did not ask me a clarifying question the way ChatGPT did. But I ran out of time to troubleshoot. So Sami-bot’s response, while correct, is exhaustingly complete and not what an intern needs to hear.

Conclusion

This weekend project gave me the following takeaways:

  • Vector semantic search and embeddings are incredibly powerful to retrieve knowledge from a clinical dataset.
  • Selecting the right original data source makes all the difference. If the raw data comes from a textbook or a stuffy academic source, the responses will adopt the same language and tone. This limits use-cases significantly even if the information is accurate! The LLM model I trained for Sami-AI is great in an academic setting where a senior resident may be reviewing for their board exams or an attending on rounds. However, the bot’s language is terrible if it was communicating with a patient or even a medical student or intern.
  • The dataset should be matched to the product. Meaning, know your use-case first; then find the data source accordingly. Google MedPaLM 2 and Hippocratic.ai built medical LLM’s that appear without clear real-world use cases. A book-smart LLM does not solve any problems by itself. LLMs trained from lectures in residency, discussions on rounds, patient encounters may be just as reliable and safe sources of clinical data as medical society publications — but the communication style may be more accessible.

--

--

Nihar Ganju MD

Nihar blends his expertise as a software engineer and a Doctor. He develops technology for healthcare and leads physicians with digital care. He also practices.