AI-Generated Medical Notes — Part 1

6 min readSep 19, 2022

Synthetic medical documents from free text

Yes, this is an AI-generated Image (OpenAI DALL·E), Credit: Leonardo Lara

Engaging in a conversation with the new last “Large Language Models” (LLMs) sometimes you feel it relatively sentient to your demands, answering questions about any subject. Although these models often produce impressive outputs, they also fail to reason and be factual on difficult real-world based questions without adding a few “shots” of labeled examples to fine-tune it to a specific task.

A LLM named BLOOM larger than GPT-3 (OpenAI) with about 176 billion of parameters was announced by BigScience, an international community-powered project coordinated by AI startup Hugging Face 🤗 was created in the last year by over 1000 volunteer researchers. Unlike some of its competidors, they made this cutting-edge AI widely available for all researchers around the world.

In a sequence of 2 articles, I’ll demonstrate the MedtriX project to generate from free text a new AI-generated medical notes based on BLOOM from MIMIC-III (Medical Information Mart for Intensive Care) dataset containing 59,652 Dischargies Summaries freely accessible.

1The part one: recognition of the medical entities from input text, methods to make similarity with medical records, and finally replacing values for synthetic data.

2 The second part: AI-generated medical sections (Hospital Course, Present Illnes, Social History and Past Medical) through fine-tuning BLOOM showing some MLOps practices and guidance to deploy LLMs.

The MedtriX application is live for your own experiments on Streamlit:

MedtriX Streamlit

All content of this article is part of my Github Repository:

GitHub - leoitcode/medtrix

This is a project to generate new medical reports from free text, including AI-Generated sections and patient reports.

github.com

Entities Detection

Personal Data

In order to recognize such as particular entities (Patient & Doctor names, Age, Admission Date, Hospital) with the best precision on medical notes, I structured a DE-ID (De-IDentification) transformers model fine-tuned over RoBERTa to detect sensitive data. This kind of approach is used to anonymize real patient documents from data exposure.
The results was:

For get the best gender guess, I applied the Gender Computer script to infer person’s sex.

Problems, Clinical Attention Words and Allergens

In fact, the matching process from these medical terms to discharge summaries goes beyond from just detect patient problems or health-related words, but also checking whether this term is relative to a past medical history or a denied existence. Sometimes it’s not even the patient’s own problem, but rather a family history.

1- Making some research I found the best performance model to get problems and diseases applying the StandfordNLP Stanza Biomedical Models based on the i2b2 dataset from Department of Biomedical Informatics (DBMI) at Harvard Medical School. For the best compatibility, I put that in a spaCy pipeline using a super useful package, you can check it on: spacy-stanza

2- Attention is all you need! For it the package scispaCy is the most recommended to reconize general medical words.

3- To have clearly distinguishable boundaries between these entities, I used medspaCy context-rules based on the ConText algorithm, to recognize the relationship between found conditions with another terms in sentence.

4- Regarding allergy and allergens, I created a function to get chemicals from spaCy Med7, foods and another substances based on the allergens database COMPARE (COMprehensive Protein Allergen REsource).

The final result of this step was a full free text with well-defined entities:

Full pipeline to get entities

Similarity Document Selection

Throughout medical notes there are several topics like “Chief Complaint”, “Hospital Course” and some history about the patient like “Social History” and “Medical History”. In order to match the found entities from free text with discharge summaries let’s play some similarity:

1- Jaccard Similarity

It is a really basic similarity approach to healthcare, however it come in handy just to shrink choices as fast as possible from almost 60 thousand candidates improving the overall performance.

2- UmlsBERT Similarity

With a reduced set of documents, here we go to a neat approach to match the found terms with the best document. The UmlsBERT is a contextual embedding model that integrates the Unified Medical Language System (UMLS) Metathesaurus, trained with an updated version of a Masked LM taking into consideration the association between specified medical words.

The UmlsBERT Similarity code:

Replacing, Generating and Faking Data

On each MIMIC medical record there are several PHI Labels (Protected Health Information) to anonymize the original personal sensitive data.

But for medical text generation purposes is necessary put each item in its proper place. A script was formulated to access places on the selected document, substituting names, numbers and generating sequentially new dates from the Admission Date. The remaining labels which is required generate fake terms I could count with a helpful tool with many fake lists about names, locations, contact and a plenty of compatible regex to finish the document.

Final Document Form

As the final result, let’s exemplify:

Input Text:

“Anne 35F was attended in Naval Hospital Beaufort in 10/01/2021 by Dr. Straus. The patient was presenting abdominal pain together with vomiting and diarrhea, following high fever and chills. The patient has history of gastritis and acute esophagitis. Patient makes constant use of alcohol and tobacco Patient has hx of gallstones. There are reports about a diagnosis of his parent with diabetes. A possible stroke was not evidencied in your last report.”

Detected Entities:

Document Selected ( highlighted attention words ):

Extras and where to go

This work is continuing on next article about new synthetic patient reports and medical sections created by AI. This first part is available for your own free text expressions on MedtriX Streamlit.

I hope you enjoy this first part. Thank you for reading through!!

About me

I’m Leonardo, currently focused on helping Machine Learning play an important role in healthcare. Fell free to contact me on LinkedIn, and I’d be happy if you follow me on Medium to check how all this is going to end.

If you have questions or observations, please, let it in the comments!

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com