NLP & Healthcare:
Understanding the Language of Medicine

Xavier Amatriain
Curai Health Tech
Published in
16 min readNov 5, 2018

At Curai we have a mission to scale the world’s best healthcare for every human being. We are building an Augmented Intelligence capability to scale doctors and lower the barrier to entry for primary care. There are many components to such a system, but medicine is, at its core, conversational, so a very important piece is being able to understand the language of patient-doctor communications.

A sneak peak into some Curai NLP efforts

Building an AI-powered primary care service involves solving many NLP tasks. Below are some of the concrete projects we are tackling. These projects use different sources of text that include all the way from from doctor notes in EHR records, which we access through our research partnerships, to real patient-doctor conversations from the Curai Health service.

  • Medical entity recognition and resolution. Given an arbitrary piece of text, we are interested in extracting different medical entities including symptoms, diseases, or treatments. We want to do so in all kinds of documents, ranging from patient conversations to medical texts. Each use case has unique requirements with its own precision/recall tradeoff. For example, if we want to extract symptoms from a patient utterance to directly kick off a diagnosis algorithm, we will want to favor precision to avoid feeding incorrect symptoms. On the other hand, if we are generating candidate labels for physicians to later confirm if these are relevant or not to the current case, we could be favoring recall.
  • Medical knowledge discovery. While we do have access to structured medical knowledge, these sources are far from complete or exhaustive. Therefore, we need to complement them by processing different kinds of unstructured medical texts and extracting patterns and entities and the relationships between them. Eventually, we want to be able to extract a knowledge graph from a collection of text documents (see 2D t-SNE representation of our current knowledge graph below)
2D t-SNE representation of our current knowledge graph
  • Question answering. We want to respond to medical questions from both patients and physicians. Examples of this include, “Can I take Mucinex while on a Z-Pack?” or “My doctor just upped my dosage of _____ by 50mg and now I’ve been feeling lightheaded, is that normal?” In this case, we want to extract the relevant medical terms and surrounding context, and use them to retrieve the documents most responsive to the question terms from a repository of curated answers
  • Medically relevant response/question suggestion. Another one of our target applications is to suggest responses or questions to physicians who are having a conversation with a patient. The suggestion not only needs to be linguistically and contextually meaningful, but it also needs to have medical validity.
  • Medically relevant auto-complete. In a similar way, we are developing auto-complete functionality to suggest ways to complete a sentence or a conversation to both patients and doctors, taking into account context, personal situation, and medically relevant patterns. Like Google’s SmartReply feature applied to our domain, we’ve seen a number of developments in this area and are imagining what these could look like in the medical domain.
  • Intent classification. Given a patient conversation, we want to infer whether the patient intent is, for example, to seek information about a known condition, to figure out an unknown diagnosis, to find a second opinion, or to find an alternate treatment, among many others.
  • Multi-modal medical classification. We are working on combining text with other modalities (e.g. images) to build better classifiers.
  • Medically-aware dialogue system. Many of the components above have the ultimate goal of developing a dialogue system that can lead a medically sound conversation with a patient. One very important requirement is for the dialogue system to elicit the right information from the patient in order to offer answers and/or to summarize the interaction and make suggestions to a physician to make the final call.

In order to develop all these functionalities, we are using many different data sources that include doctor notes in electronic hospital records, medical literature, and transcripts from years of patient-doctor conversations. And, very importantly, we build from years of existing research on NLP in general, and in the intersection of NLP and Healthcare/Medicine. This rest of this post examines this existing research and provides a glimpse into what we should expect in the near future.

Some history

When we started working on the projects outlined above, the first thing we did was to analyze the current state of the art. You might not be surprised to hear that we are not the first ones working in the intersection of NLP and healthcare, but you might be surprised to hear that there is a wealth of research in this area, going back as far as 50 years. As a matter of fact, medicine and healthcare has been a preferred area of focus for AI in general (and NLP in particular) since the inception of the field. This is not a mere coincidence. The medical conversation offers a perfect testbed for language understanding and conversational systems. It is a relatively constrained domain that lends itself to somewhat structured and predictable conversational patterns. On the other hand, it is broad and complex enough that it cannot be captured with simple rules.

A typical doctor patient conversation tends to follow the following template:

Doctor: “How can I help?”

Patient: <Chief complaint>

Doctor: “Anything else?”

Patient: <Optional secondary complaints>

Start of a doctor-led Q&A:

E.g. “Do you have X?”

Doctor communicates actionable recommendation (diagnosis + treatment, triage, referral…)

Note though that a template like this can still have lots of complexity and multiple variations. Not surprisingly, physicians are taught interview techniques as part of their regular training.

It is not surprising that researchers turned their attention to this domain when looking for ways to exercise their early experimental systems. As a matter of fact the most popular application of Eliza, usually cited as the earliest example of a conversational agent was re-creating the conversation between a psychotherapist and a patient, even if under the hood it was using a very primitive and hardly generalizable rule-based approach.

Just a few years later (1971), Internist-1 became not only the first realistic example of a complete medical decision support system, but also a prime showcase of state-of-the art AI and dialogue agents (so much so that the system was originally called DIALOG). Internist-1 was considered to have very high quality medical information. Jack Myers, who led the project, was considered one of the best clinical diagnostic experts of his time. Besides, adding a single disease to the knowledge base required 2–4 weeks of full-time effort from knowledgeable doctors reading anywhere from 50 to 250 publications.

To be clear, and as you can see in the transcript below (See here for the full example of the transcript), Internist-1 did not provide a full-fledged advanced NLP dialogue system. The conversation was very structured and heuristic-driven. However, you can already envision how, the possibility of combining the quality of medical knowledge in Internist-1 with the “naturalness” of Eliza created high expectations in the 70’s .

Internist-1 Transcript

The present

While we can track the interest of medical conversational systems 50 years back, it is interesting to see how this area has seen a sudden spike in interest in very recent years. If you search for medical chatbots, you will find dozens of companies working in this area. This article, for example, compares 5 chatbots:, Sensely, Buoy Health, Infermedica, and Florence and there are many more (e.g. Isabel, Babylon Health, Ada and, of course, Curai).

Health “chatbots” in the news

Some recent work is from companies figuring out a way to use older technologies in a modern context, but the research field has also seen a recent surge of publications in conversational agents for healthcare related applications. This recent meta-study reports on 14 recent healthcare related chatbots. You can find research on dialogue systems or chatbots for diabetes, primary care, or pediatrics. There is also a long list of experimental chatbots for mental health such as Woebot.

Why now? On the availability of medical data

What is bringing this space back into the spotlight precisely now? An obvious answer is that AI in general and NLP technology in particular, have improved dramatically over the past few years. I will get into that in the next section. Another not-so-obvious but related reason is the availability of data that enables better systems to be developed.

One imperfect source of data are the so-called Electronic Health Records (EHR) or Electronic Medical Records (EMR). These systems store large scale patient-level information about encounters between patients and the healthcare system. While much of it is focused on information that is centered around billing and has limited medical quality, it is also true that there is a wealth of valuable medical textual information in the form of medical notes (which are captured in the EMR as unstructured text objects). It is interesting to see how these notes differ from the medical conversations we described before. Still, and as we will see later, medical notes have been used for several NLP applications.

Example of state-of-the-art EHR software including medical note from EMR-EHRs

While doctor notes can be in principle completely unstructured free-form text, most doctors are encouraged to use the SOAP template where SOAP stands for Subjective, Objective, Assessment, Plan. See full description in image below for more details.

While getting access to electronic medical records or medical notes in general can be very challenging, it is worth mentioning that there are some Open Data initiatives that are trying to address the issue and give researchers around the world the opportunity to do meaningful research. Probably the most well-known one is Mimic initiative developed by the MIT Lab for Computational Physiology. Mimic-III, the most current dataset, includes de-identified health data associated with ~40,000 critical care patients, including demographics, vital signs, laboratory tests, and medications. Another important initiative is i2b2, a broad initiative that has published datasets such as NLP #5, a complete set of annotated and unannotated, de-identified patient discharge summaries. Finally, there is hNLP (Health Natural Language Processing Center), a recent initiative that provides datasets to its affiliates (ping me if you want to hear about our experience with the center before paying the somewhat hefty fee).

Another source of large-scale medical text are the existing databases of medical research publications such as Pubmed. Having access to such corpus enables, for example, the automation of knowledge extraction that used to be done by hand. However, it is not only about traditional publications. Crowdsourced medical online material such as Wikidoc or HumanDX can sometimes compete in quality and breadth of coverage.

A final source of relevant information that sets the medical field apart from other domains is the availability of different ontologies, vocabularies, or knowledge bases. While none of them might be perfect for any given application, they do include a lot of very valuable information that can be combined or built upon. Most relevant initiatives include ICD, Snomed-CT, and UMLS. Below is a brief description. A more detailed analysis with pros/cons is beyond the scope of this post, but it is important to note that none of them is the holy grail that solves all possible requirements of complex use cases.

ICD10 is the 10th, and most current, revision of the International Statistical Classification of Diseases and Related Health Problems (ICD) although if you work with most medical systems you are likely to find a combination of ICD9 and ICD10 codes, which are not fully compatible. This is a rather simple collection of medical codes for medical concepts such as diseases, symptoms, or complaints. Interestingly, it is an evolution of the classical Bertillon Classification of Causes of Death (1893) and it is currently managed by the World Health Organization — so the hope is that it is well standardized even across most countries.

Snomed CT (Clinical Terms) is a collection of computer processable terms used in clinical documentation and reporting. Its goal is to be comprehensive and include any medical term including clinical findings, symptoms, diagnoses, procedures, body structures, organisms substances, pharmaceuticals, or devices. It is much more ambitious in scope than ICD and can be considered a full-fledged ontology that includes term relations, hierarchies, and composability. However, it is complex and sometimes inconsistent, especially if you consider its multiple mutations since it started in 1965, so its usage is not widespread.

Finally, UMLS (Unified Medical Language System) is a meta-ontology maintained by the U.S. National Library of Medicine. It is a compendium of many controlled vocabularies and it includes a Metathesaurus, a Semantic Network, and the SPECIALIST Lexicon and Lexical Tools which provide, for example, different ways to measure semantic similarity between medical concepts or to translate among the various terminology systems. One of the most useful (and used) features provided by the Metathesaurus is the notion of Concept Unique Identifier (CUI).

Medical dialogue systems: components, state-of-the-art, and current research directions

In order to understand the following section, you would need to have some understanding of NLP and dialogue systems in general. If you don’t, I recommend you take a look at Jurafsky and Martin’s “Speech and Language Processing” online book, particularly chapters 25 and 26 on dialogue systems, but also 24 on question answering.

As you will read in any of the material recommended above, and all the associated references, medical dialogue systems such as the one referenced in previous sections are not called “chatbots” in the literature. This term is reserved for dialogue systems that have no particular purpose and whose only goal is to appear natural in open indirected chat or conversations. Most, if not all, medical dialogue systems fall under the category of so-called “Task-oriented Dialogue Systems”. Examples of other task-oriented dialogue systems are Apple’s Siri or Google Assistant. The blueprint for such systems is outlined in the diagram below (adapted from Young, 2000):

I will now only focus on research pertaining to the Language Understanding component since it is the one that generally deserves most attention. Note that this component includes the following three main sub-components:

For most healthcare applications we can safely skip the first one since the domain (healthcare) is known a priori and jump right into intent classification.

Even when you constrain your domain to healthcare, a user engaging with a system can have many different kinds of intents. They could be interested in figuring out a diagnosis given some symptoms, finding a treatment given a diagnosis, a nearby doctor, a second opinion, ask a question about diet or drug side-effect, or request a prescription. For an in-depth analysis of intents in healthcare information searching, I would recommend reading “From health search to healthcare: explorations of intention and utilization via query logs and user surveys” by White and Horvitz from MSR. Among the many interesting observations, there is the graph below, which highlights that user intent varies over time and depends on the user interactions with the medical system.

From health search to healthcare: explorations of intention and utilization via query logs and user surveys

At its core, intent classification is nothing more than a text classification task. Therefore, any approach used for text classification (from SVMs to CRFs) can work. More recently, approaches using vector models and Deep Learning supervised classifiers have shown good results. Of course, which approach can work best depends on the availability and quality of trained labeled data.

While intent classification in the healthcare domain can be treated as a general case of intent classification, some recent research has tried to make use of higher-order semantics by applying knowledge graphs and vocabularies such as the ones described above (see “Bringing Semantic Structures to User Intent Detection in Online Medical Queries”). In any case, just as the next step in the dialogue flow (Slot Filling), these approaches require to extract some structure from the text.

Learning Structure from Text

The “traditional” approach to task-oriented dialogue systems is based on so-called Slot Filling. The idea is rather simple: you start with a pre-determined frame that defines the different elements (slots) that need to be obtained for a task, and then you apply different techniques to drive the dialogue to the goal of obtaining those pieces of information. If the task is booking an airplane ticket, the slots will be things like “departure airport”, “destination airport”, “day”, “preferred time”, etc… In the case of a health-related dialogue, the frame will depend on the intent. For example, if the intent is to obtain a diagnosis, the frame could include things like patient demographics, symptoms, signs, or family history. Note how this is very much the same as saying we want to extract structure from text as illustrated in the example below. Of course, defining a complete frame for each healthcare intent is complex and depends on many things including context and personal information.

It is also interesting to note that the concept of Frames has been floating around in medical informatics for many years. See, for example, “An interlingua for electronic interchange of medical information: using frames to map between clinical vocabularies” (1990), where frames for generic concepts such as chest pain were defined (see example):

One of the most important tasks to go from natural text to some form of structured frame-like representation is to extract entities. We need not only to be able to recognize named entities from a lookup, but be able to reason about the fact that “my head is about to burst” should be mapped to “severe headache”. Other challenges include word sense disambiguation (a muscle tear vs. a tear falling from your eye), implicit symptoms, or complex negations and modifiers. Entity recognition has become one of the most studied tasks in the health NLP research community. An important reason for that is the fact that the i2b2 initiative mentioned above has been promoting challenges that are directly or indirectly related to this task. The recent “Entity recognition from clinical texts via recurrent neural network” did an experimental comparison of different ML approaches to medical entity recognition. The conclusion is that LSTMs perform only slightly better than Structured Support Vector Machines on the task of concept extraction.

Besides RNN/LSTMs another recent works that is worth mentioning are “Disease named entity recognition by combining conditional random fields & bidirectional recurrent neural networks”, an approach that combines Bi-directional RNN’s with CRF’s. It is also important to note that despite Deep Learning approaches showing some advantages, traditional methods such as CRF’s or even dictionary lookups can be competitive in some practical settings (see “Named Entity Recognition Over Electronic Health Records Through a Combined Dictionary-based Approach” or “Knowledge-driven Entity Recognition and Disambiguation in Biomedical Text”).

In order to fill slots in a frame, it is not only enough to detect entities, but those need to be related and connected to the generic slot they refer to. Techniques though are not much different than the ones outlined until now. As a matter of fact, some of the i2b2 challenges, such as the Event Detection one, require this kind of output. See “Bidirectional Recurrent Neural Networks for Medical Event Detection in Electronic Health Records” for a good example of this.

Bidirectional Recurrent Neural Networks for Medical Event Detection in Electronic Health Records

Finally, when trying to infer structure from text, it is not only important to extract entities, but also to be able to reason about semantic similarity of concepts. This gets us to the idea of vector spaces since approaches such as word2vec have proved to be very powerful in doing semantic operations between concepts. There have been different attempts at developing medical or healthcare-focused word vectors. The most recent one “Clinical Concept Embeddings Learned from Massive Sources of Medical Data” used an impressive collection of insurance claims from a database of 60 million members, 20 million clinical notes, and 1.7 million full text biomedical journal articles to mao 108,477 medical concepts.

The Deep Learning promise: from better classification to end-to-end dialogue systems

It should be clear from the previous section that Deep Learning is also making a dent in traditional NLP tasks that are needed even if using a slot filling approach to dialogue systems. But, there is more: deep learning promises to even disrupt the way whole dialogue systems are designed, end to end.

In order to better understand how Deep Learning can disrupt Medical NLP and Healthcare Dialogue Systems, you would benefit from a general understanding of the field. I’d recommend two tutorials: Vivian Chen’s wonderful “Deep Learning for Dialog Systems”, and “End-to-end goal-oriented question answering systems” by the LinkedIn team. Seb Ruder’s recent post “A Review of the Neural History of Natural Language Processing” and Haixun Wang’s “An Annotated Reading List of Conversational AI” are also great reads with many pointers.

As far as I know, there are no published results of end2end conversational models applied to medicine or healthcare. As Jeremy Howard of explains in this video (minute 1), one of their fellows (Christine Payne) did use a pre-trained language model to develop a medical question answer system. While Jeremy does point out to many challenges and shortcomings of these approaches, it is clear to me that we will see them flourish in the near future, especially when combined with some more structured medical knowledge bases.


Conversational systems and medicine have been closely connected for many years. Recent advances in AI in general and NLP in particular have brought back the promise of enabling truly “intelligent” medical applications. However, there is still a lot to do and very exciting research ahead of us. If you are interested in working on this space, please check Curai’s job page or reach out to me directly.