AI system design: NLP for doctors

Uday Pulleti
Jun 7 · 9 min read
Source: Getty Images

I have to admit, the futuristic AI doctor above is a bit scary! But no worries, in this article we discuss how NLP can help present day doctors in their day to day life. As AI/ML practitioner it is important to think holistically about the final product even when implementing a small subsystem or algorithm. In some cases you may not need a Trillion parameter model if a slight tweak is made to the way input data is collected. A complex problem that seem to require a deep neural network could be broken into two small decision tree sub problems. We try to demonstrate that in this system design of a product that helps doctors keep track of patient-doctor conversations.

One of the most tedious parts of a doctor’s day is taking notes of their conversations with patients and filing it in electronic health records (EHR). Doctors spend as much as 35% of their time on this type of documentation. Though this is non-productive use of a doctor’s time, it is very important to document every doctor-patient interaction as it is integral to patient care. Traditionally doctors type up these interactions themselves or use professional medical transcription services which are costly and error prone. We present an AI system that captures doctor-patient interaction and produces a SOAP report.

What is a SOAP report?
SOAP is an acronym for subjective, objective, assessment, and plan. This report captures everything that is important in a doctor-patient interaction.

Subjective: Patient description of symptoms, history of illness and any other relevant information about the reason for the hospital visit.
Objective: Information that the doctor observes or measures like weight, height, pulse, temperature, respiration, swelling, skin color etc. Results of completed diagnostic tests are also included in the objective.
Assessment: This is the initial diagnosis of the doctor by synthesis of “subjective” and “objective” information. This is subject to change pending results from recommended tests.
Plan: This section details the need for additional testing and consultation with other clinicians to address the patient’s illnesses. It also addresses any additional steps being taken to treat the patient. This section helps future physicians understand what needs to be done next.
Detailed SOAP format:

The objective of this product is to capture doctor-patient conversation and generate a SOAP report out of it.

To better understand the complexity of the problem, below is an example audio file, raw transcription and the SOAP report generated by the doctor.
Raw audio file :::::::: Transcription ::::::: SOAP

Given the complexity of the problem, it is prudent to breakdown the system into smaller manageable subsystems with clearly defined interfaces. The SOAP generation system can be divided into below subsystems:
1. Data capture
2. Data pre-processing
3. Model pipelines for each of the output sections
a. Subjective
b. Objective
c. Assessment
d. Plan
4. Inference and continuous learning

Figure 1: Data capture and pre-processing

1. Data capture

Given the practicality of the problem where we are analyzing the conversation between people in uncontrolled environments, it is important to capture information as precisely as possible. From the transcription given above, it is evident that speech to text accuracy is very low. Taking this into consideration, Figure 1 shows some of the important aspects of multi-modal data capture and pre-processing .
Segmented offering with layered accuracy: Customers could be offered product variations depending on the sophistication of data collection devices, providing different QoS for different pricing plans. Webapp/smartphone app/smart devices (audio + video with option of multiple connected cameras and microphone arrays placed at various location in the physician’s office. Ex: 1 device at physician consultation desk, 1 at examination bed and 4 devices in four corners of the room) can be deployed with data aggregated from multiple devices.

2. Data pre-processing

From captured multi-modal, multi-stream data, the most important information to extract is speaker separated text which can be implemented more accurately by using audio and video information as implemented in (1) ((2) can be used if only audio data is available). Other pre-processing tasks include suppressing audio from other speakers, filtering out non-word sounds, and transforming ill-formed sentences to well formed sentences. Other information like body parts shown by the patient, emotion etc. can also be extracted and used for downstream tasks.

3. Model pipelines for each of the output sections

We can take three approaches for output section generation.
a. End-end learning based approach
b. Linguistic rule based approach
c. Hybrid (rule based + learning) approach

In an ideal world an end-end learning based solution tends to be the best approach as it involves building a learning model that generalizes to different variations of the data, handling even the most difficult corner cases. But for many practical problems it may not be feasible to build an end-end learning model mainly due to the unavailability of enough annotated training data and because of difficulty in architecting sophisticated deep learning models that can efficiently learn the problem at hand. End-end learning approaches tend to work well if the available annotated data is at least 50K-100K examples and perform very well when the annotated data >~1M training examples.

Traditional Linguistic rule-based approaches in NLP tend to work well for well-defined data that does not have unexpected corner cases. They require high level of human expertise to design the rules and tend to make system complex and difficult to scale.

For the output summary generation problem, both end-end learning approach and hybrid approach are worth experimenting with.

End-end learning approach:
Analyzing the available data, the length of the doctor-patient conversation transcription is 1500 words which translates to ~2000 tokens, among which subjective: ~250 tokens, objective : 250 tokens, assessment : 100 tokens, plan : 150 tokens. So, we need a seq-seq model that can work efficiently at ~3K-4K token length. One of the main limitations of current SOTA seq-seq learning models like T5 (4) and GPT2 (5) is the maximum sequence length they can process. They work very well at 512–1024 token lengths. But for long sequences (sequences close to 4K tokens), models like Transformer XL, Reformer, Big Bird (7) and Longformer (6) are shown to work well. If we have large enough training data (>100K) we can train a seq-seq model by generating training examples as follows:
1. <Transcription> …. <Subjective> …
2. <Transcription> …. <Objective> …
3. <Transcription> …. <Assessment> …
4. <Transcription> …. <Plan> …
The accuracy of these models needs to be assessed as they are still evolving. The key for better accuracy is to localize attention for each output section generation. Methods to localize attention in an end-end learning algorithm would be a worthy direction to pursue.

Hybrid approach (end-end learning + rules):
Another approach is to use rule-based classifiers to break down input data to contextual segments and then train seq-seq models for each segment. Here we can formulate NLP rules to segment the physician-patient transcription into segments that can be fed to seq-seq model with different output prompts. We can further convert the input unstructured data into structured text using methodologies like entity extraction and classification, assertion extraction (positive/negative intent extraction), relation extraction which are described in the summary generation sections.

Figure 2: Transcription segmentation

Transcript segmentation: Analyzing the transcription data, the conversation can be split into 3 parts as shown in Figure 2:
1. Segment 1: Interactive question and answers between physician and the patient.
2. Segment 2: Narration of the physician during physical examination.
3. Segment 3: Assessment plan narration by the doctor.

The 3-part segmentation could be done using simple heuristics. First part based on interactivity of the conversation, 3rd part starts when the physician says assessment plan and continues talking for a while. The second part is in between these boundaries. We can add more heuristics like utterance of body parts to strengthen the boundary of part 2.

3.a. Subjective part generation:
This part is generated mostly from the interactive question and answers between physician and the patient. If length of Segment 1 + subjective summary is less than 1K tokens for a large percentage of conversations, training examples can be directly generated as follows and a generative model like GPT2/GPT3 can be trained to generate Subjective summary:

Training example: <Segment 1> …. <Subjective> …

If the length of Segment 1 + subjective summary is greater than 1K tokens for a large percentage of conversations, we need to explore ways to compress the Segment 1 data. Few options could be:
1. Extract only relevant structured data from the conversation. Using a medical entity recognition model like the one described in (8), extract tagged entities (along with assertions) in as structured format as shown in Figure 3.

Figure 3: Medical Named Entity recognition model with tags

2. Use just the patient responses from Segment 1.
3. Transform doctor-patient question answer pair of the conversation into single information sentence. Example: Doctor: Where does it hurt: Patient: In my nose. Transformed sentence: It hurts in my throat.

3.b. Objective part generation:
This part is generated mostly from Segment 2 which is Narration of the physician during physical examination. There are clear subsections described in the SOAP report like head, cardiovascular etc. If Segment 2 + objective length > 1K, we can explore the possibility of sub-segmenting Segment 2 into each body part sentences. We can also follow a similar approach as presented in above section 3.a for generating compressed structured text from Segment 2 and using that to construct training examples for seq-seq model.
Training example: <Segment 2> …. <objective> …
Training example: <Segment 2:Head> …. <Objective: Head> …

3.c. Assessment generation:
This part is generated mostly from Segment 3 and is a short medical diagnosis. As all possible diagnosis could potentially be available in training data. This can be constructed as a multi-label classification problem with Segment 3 as input and predefined diagnosis as class labels. Segment 3 can also be first converted into structured text as mentioned in 3.a and used to build a classifier. An obvious issue with this approach will be unseen diagnosis in training data will never be identified.

3.d. Plan generation:
This part is generated mostly from Segment 3 will follow the similar approaches mentioned in 3.a.

Figure 4: Generalized SOAP section generation pipeline

Below are a few other aspects to pay attention during system design.
Data analysis:
1. Checking consistency in the data if it can be split into 3 well defined segments as described in Transcript segmentation:.
2. Max and percentile score of length of tokens in the 3 different segments of transcription.
3. Max and percentile score of length of tokens in different sections of output summary.
4. Max and percentile score of length of tokens of patient response in the transcription.
5. Max and percentile score of questions asked by the doctor and responses by the patient.

Assessing the accuracy of generated output: Despite prevalence of various automated metrics like perplexity, BLEU, ROUGE, BLEURT, GEM, GENIE etc., manual assessment (which is prohibitively time consuming and requires domain expertise) is the most reliable way to measure the accuracy of generated output.


1. VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency

2. Wavesplit: End-to-End Speech Separation by Speaker Clustering

3. Biomedical Named Entity Recognition at Scale

4. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

5. GPT2

6. Longformer: The Long-Document Transformer

7. Big Bird: Transformers for Longer Sequences

8. Improving Clinical Document Understanding on COVID-19 Research with Spark NLP

Nerd For Tech

From Confusion to Clarification

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit

Uday Pulleti

Written by

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit