Automated Text Extraction from Medical Documents with Natural Language Processing: Rule Based.

Sanghamitra Deb
Mar 21, 2016 · 5 min read

Reading and interpreting medical documents and combining them with external data sources is a necessary part of drug discovery. For example design of new clinical trails requires understanding of older trials, laboratory notes and FDA data to ensure procedures are not repeated. Sometimes researchers are interested in questions such as “What age group was studied for ritalin?”, “What are the common side effects of gentamicin”, and so on …

Answering these questions requires extraction of specific medical entities. For example in order to understand side effects we need to extract drug names, disease names and ascertain side effect relationship between drugs and diseases. In a different case we might need to ascertain the treatment relationship between drug name and gender/age-group. Armed with these relationships between entities we will be able to build a queryable knowledge graph.

Graphical Representation of disease, genes and drugs.

In this blog-post I will cover extracting drug-disease treatment relationship from FDA drug labels. This is a typical FDA drug label.

FDA Drug Labels

Most of the drug labels have an “Indications and Usage” section. In order to extract the treatment relationship I will be using this section. The Drug labels are available in xml format. I will be using python to decrypt xml.

Extraction of xml data using the xmltodict module in python

After extracting the text that will reveal the treatment relationship the first step is to do the normal text cleaning (removing special characters, decoding unicode, lowercasing the text, etc). The next step is to parse the data. I have used Stanford NLP parser for this purpose. I use the resulting lemmatized and parsed text for further analysis.

Lemma and POSTAG output from Stanford NLP Parser

The next step is to extract all nouns and any adjectives preceding the nouns. This way we capture disease words such as “pain” and “chronic asthma”. This process limits us to bi-grams only. It is possible to expand on this work to include tri-grams and longer disease names. A more rigorous path would be to use disease ontologies from UMLS or SNOMED to identify the disease lexicons. In not doing so we will have words that are not disease related polluting our results. The other important steps are removing stop words and normalizing specific to diseases. What this means is “infection due to HIV” should not be the same as “infection from respiratory disorder” and the same drug will certainly not treat both symptoms.

In order to get better stop word list I do a word count for the diseases and manually curate a stop word list by checking the 100 most commonly occurring words. This makes the disease words cleaner and we do not have words such as “tablet” and “solution” in our results.

The next step is to look for anchor words such as “indication”, “treatment”, “relief”, “reduce”, etc and find the relative position of the disease word in the sentence. We use the rule that the words that are closer to the anchor words are more likely to be the disease treated by the drug.

Using these rules we narrow down the treatment relationship between drugs and diseases. Here is a snapshot of the results.

If we look carefully we still have noise in the disease column. As a matter of fact this is a high recall solution. Approximately for 82% of the data the drug extraction is correct.For 7% of the data the drug extraction is correct but there are too many extra words. For 11% of the data the drug extraction is incorrect.

Future Work

This is preliminary rule based study of FDA drug labels. There are several directions of improvement. The drugs which had incorrect disease extraction have a different data structure and “indications and usage” has not been captured in the text. This needs further investigation. The drugs that have too many extra words require machine learning to have only the correct disease words.

One solution is create features based on relative position of words with respect to anchor words and the distribution of words around a specific disease noun. Next we can feed it to a classifier such as logistic regression or Support Vector Machines to predict if a disease is treated by a specific drug.

Furthermore labeling of disease lexicons can be improved by using data from the LabeledIn study. The training data can also be generated from this dataset.

Annotated Data from LabeledIn Study


A quick rule based technique for drug extraction provides moderate results for extracting drug-disease treatment relationship from FDA labels. There is room for improvement of these results and opportunities to extract other relations such as drug-disease side effect relationships to create an knowledge ontology in this space. Some of these problems will be addressed in future blogs.


(1) Automatic extraction of drug indications from FDA drug labels. Khare R1, Wei CH, Lu Z.

(2) LabeledIn: Cataloging Labeled Indications for Human Drugs. Ritu Khare, Jiao Li, and Zhiyong Lu.

(3) Large-scale extraction of accurate drug-disease treatment pairs from biomedical literature for drug repurposing. Xu R, Wang Q.

(4) Graph Theory Enables Drug Repurposing — How a Mathematical Model Can Drive the Discovery of Hidden Mechanisms of Action. Ruggero Gramatica1, T. Di Matteo, Stefano Giorgetti, Massimo Barbiani , Dorian Bevec2 , Tomaso Aste

Sanghamitra Deb

Written by

I am a Data Scientist at Chegg Inc, an Astrophysicist, Ph.D in my prior life. My day is spend working with data, NLP, machine learning, statistics, …

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade