Medical Entity Extraction on Google Cloud: A Comprehensive Guide
In the vast landscape of healthcare data, extracting meaningful medical entities is akin to finding needles in a haystack.Extracting crucial medical entities, like medication names, dosages, and diagnoses, from unstructured text is a cornerstone of healthcare data analysis.
In this blog post, we’ll delve into three distinct approaches for medical entity extraction: utilizing Healthcare NL APIs, harnessing the power of Large Language Models (LLMs), and employing open-source models. Let’s explore the strengths and considerations of each method.
Entity Extraction using Healthcare NL API
The Healthcare Natural Language API is a specialized machine learning tool within Google Cloud’s Healthcare API that understands and processes unstructured medical text (like doctor’s notes, clinical reports, etc.). It produces organized, structured data about the medical concepts found in the analyzed text. This data is much easier to use for further analysis and automation tasks than the original raw text.
Capabilities:
Entity Extraction: Identifies and pulls key medical items:
- Diseases
- Medications
- Medical procedures
- Medical devices
- Important attributes of these concepts (e.g., dosage, frequency, side effects)
Concept Mapping: Links identified entities to standard medical vocabularies. This is crucial for making the extracted information compatible with other healthcare systems. Supported vocabularies include:
- RxNorm (medications)
- ICD-10 (diseases)
- MeSH (broad medical terms)
- SNOMED CT
Limitations :
- Breaks the concept into multiple single labels, for example : Leg Pain a symptom will be labeled as Leg : Anatomical structure and Pain : Problem in response.
- Doesn’t work well with non generic medicine name/ local medicine name like [Crocin, Zandu balm etc]
Example :
Input :
sample_text = "Course in Hospital: Mr. Johnson arrived in the ER from nursing home with a three-day history of worsening shortness of breath, yellow-green sputum, and increased sputum production. He was subsequently diagnosed with a COPD exacerbation and was satting at 84% on 4L O2 by nasal prongs. Medical presciptions : TAB PARACIP 500 MG two TABLETS PER ORAL THRICE DAILY AFTER FOOD FOR 5 DAYS INJ. AUGMENTIN 1.2 GM, INTRAVENOUS, THREE TIMES A DAY X 4 DAYS"
Truncated Output :
'entityMentions': [{'mentionId': '1',
'type': 'SEVERITY',
'text': {'content': 'worsening', 'beginOffset': 97},
'linkedEntities': [{'entityId': 'UMLS/C1457868'},
{'entityId': 'UMLS/C4084902'}],
'confidence': 0.9495142102241516},
{'mentionId': '2',
'type': 'PROBLEM',
'text': {'content': 'shortness of breath', 'beginOffset': 107},
'linkedEntities': [{'entityId': 'UMLS/C0013404'}],
'temporalAssessment': {'value': 'CLINICAL_HISTORY',
'confidence': 0.9956560730934143},
'certaintyAssessment': {'value': 'LIKELY',
'confidence': 0.9996985197067261},
'subject': {'value': 'PATIENT', 'confidence': 0.9996469616889954},
'confidence': 0.9954680800437927},
{'mentionId': '3',
'type': 'PROBLEM',
'text': {'content': 'yellow-green sputum', 'beginOffset': 128},
'linkedEntities': [{'entityId': 'UMLS/C3812802'}],
'temporalAssessment': {'value': 'CLINICAL_HISTORY',
'confidence': 0.9959474205970764},
'certaintyAssessment': {'value': 'LIKELY',
'confidence': 0.9996469616889954},
'subject': {'value': 'PATIENT', 'confidence': 0.9996917247772217},
'confidence': 0.6370829343795776}
}
Try the Healthcare Natural Language API
Entity Extraction using LLMs
Large Language Models (LLMs) offer a compelling solution for entity extraction tasks. Their ability to understand complex language patterns makes them particularly adept at identifying concepts like medicine names, dosages, and strengths within discharge notes and other medical text.
Med-PaLM: A Specialized LLM for Healthcare
Med-PaLM leverages Google’s advanced language models, specifically fine-tuned for the medical domain. This specialized training, which includes medical exams, research, and consumer queries, enhances Med-PaLM’s performance in understanding medical terminology and context.
Pros
- Ease of Implementation: LLMs can be customized for entity extraction through prompt engineering, offering a more streamlined development process.
- Adaptability: Few-shot learning allows LLMs to rapidly learn new entity types, even with limited training data.
- Accuracy: Pre-training on vast medical text corpora gives specialized LLMs like Med-PaLM a robust understanding of medical language.
Cons
- Hallucination: It’s crucial to implement safeguards to address the potential for LLMs to generate inaccurate entities. Rigorous validation is key.
- Output Inconsistency: Strategies for standardizing the structure of extracted entities ensure compatibility with downstream systems.
Prompt :
You are a medical expert and you have to all extract Medicine Name, dosage, frequency, duration from the following text in the format
[Medicine Name, Medicine strength, Medicine Dose, Medicine frequency, Medicine duration].
if medicine attribute is not found fill None. Output should be strictly a list containing 5 items.
Example of
Medical strength = ['10mg', '90MG', '40 MG']
Dosage = ['ONE TABLET', 'two doses', '3 TSF', '2 PUFFS', '1 TAB'] dosage.
Frequency = ['THRICE DAILY', 'ONCE AT BED TIME', 'TWICE A DAY', 'ONCE A DAY']
Duration = ['21 days / 28 days', '15 days', 'one week']
Input : "1. TAB PARACIP 500 MG two TABLETS PER ORAL THRICE DAILY AFTER FOOD FOR 5 DAYS (PAIN)
2. TAB PAN 40 MG ONE TABLET PER ORAL ONCE DAILY BEFORE BREAKFAST FOR 7 DAYS (ACIDITY)"
Expected output : [['TAB PAN', '40 MG','ONE TABLET', 'ONCE DAILY', '7 DAYS'], ['TAB PARACIP ', '500 MG','two TABLETS','THRICE DAILY', '5 DAYS']]
Entity Extraction using Open Source Models :
MedSpaCy is a Python library built on top of the popular natural language processing (NLP) framework, spaCy. It’s specifically designed to handle the complexities of clinical and medical text.
Why Use MedSpaCy?
- Medical Domain Expertise: MedSpaCy’s models and rule-based systems are trained on medical text, making it more accurate than general-purpose NLP tools when working with clinical notes, research papers, and other medical documents.
- Seamless with spaCy: If you’re already familiar with spaCy, integrating MedSpaCy is easy. It adds specialized pipelines to your existing spaCy workflow.
- Customization: You can fine-tune MedSpaCy’s models or add custom rules for your specific medical text processing needs.
import spacy
import medspacy
nlp = medspacy.load("en_core_web_sm") # Load a pre-trained MedSpaCy model
text = "Patient reports headache and nausea. Denies fever. Prescribed ibuprofen 200mg."
doc = nlp(text)
for entity in doc.ents:
print(entity.text, entity.label_)
Dive into the code here
References:
- https://cloud.google.com/healthcare-api/docs/how-tos/nlp
- https://sites.research.google/med-palm/
- https://github.com/medspacy/medspacy
Thanks for reading.
Your feedback and questions are highly appreciated. You can connect with me via LinkedIn.