3. Two ways we used NLP grammar functions (and three ways they fail)

Using spaCy to break down complex clinical text

Published in

Clinical Trial NLP Challenge

6 min readMay 9, 2018

This week, we’re sharing our efforts to develop a grammar-based approach to extracting information from clinical trial descriptions for potential patients. Our efforts have been concentrated on developing two functions which rely on grammar to grab the information we want:

Tool 1: Burden Scheduling

The first type of information we aimed to extract using grammar was information about scheduling. The questions of “at what times?”, “how often?” and “for how long?” are of primary concern for patients, and we previously noticed reliable indicator words for this sort of information such as “minutes”, “hours”,“days”, “months”, etc. in the trial descriptions.

Here’s a typical passage that contains info about what we’re referring to as “burden scheduling”:

“Patients receive paclitaxel IV over 3 hours and cisplatin IV on day 1, followed by topotecan IV over 30 minutes on days 1–3.”

In cases like this one, simple use of indicator words are sufficient to extract the information we need. Just using the indicator words mentioned above, paired with a regular expression to capture the adjacent numerical information, our system yields the temporal information bolded above. From analyzing this information alone, we aim to tell the patient this trial will require at least 3 days of active treatments, with 3 hours of their time on day 1, and 30 minutes on the rest.

But getting there is not going to be simple. Theoretically, these indicators function inside of prepositional objects, describing how the scheduling of events in the trial will proceed. Grammatically, each one of these objects can be traced back to a parent verb — in our sentence, “receive.” Thus, collecting the words up from the temporally-indicated prepositional objects back to their parent verbs became the target action of our first grammar function — we call this extract_temporal.

To build extract_temporal we utilized the grammar parsing module called spaCy. It tells us about these dependency relations between words, but it’s unfortunately not always correct. The very complex clinical writing can throw off this tool, and grammatical errors in the writing — which often exist in noisy free text — make corresponding parse information invalid. For example, the challenging text of our example sentence leads spaCy to miscategorize “…over 3 hours…” as a direct object. To manage this issue, we’ve relaxed our code to just collect the words up from the indicators to their parent verbs. This helps extract more information, but reduces the tool’s precision (a topic we will address directly in our next post).

So, what can we do with extract_temporal’s output? Well, the ideal output from our example would be:

“… receive … over 3 hours … on day 1, … over 30 minutes on days 1–3.”

Unfortunately, this is still not something ready to support information-seeking patients. The information we extract is almost certainly a fragment, confusing if presented out of context. It quantifies the desired scheduling information, although further analysis is required to summarize this. Our next steps here will be to build word-based rules that transform the quantities according to the combinations of prepositions (e.g. over, on, for, until, etc.) to get final patient-friendly output. This is a challenging task — a lot like date-time parsing — that will unfortunately take more time than our whirlwind pilot allows. However, we’ll be discussing in an upcoming blog post a solution that uses supervised machine learning and will definitely produce some workable output.

Tool 2: Intervention

For this topic, we noticed early on that strong indicators were often verbs like “receive” and “undergo”. These appeared to have very high precision in our analysis of example sentences across the trial description data. But unlike the scheduling topic temporal indicators, it’s difficult to enumerate all of the verbal-indicator possibilities out there.

In thinking through how grammar can be used to bolster our indicators, we noticed that patient interventions are frequently described with proper nouns. For instance:

“Patients will undergo an MRI scan with a maximal duration of 45 minutes.”

From specific procedures to drug names, we found this pattern of proper nouns as interventions repeatedly in the clinical trial data. Moreover, we noticed that these treatments usually have an indicator verb nearby: in this case, “undergo.”

To test out our hypothesis, we again enlisted the help of spaCy. We scanned every sentence in our data set for proper nouns, and used spaCy’s sentence parser to capture what we’re calling the “parent verb”, or the verb at the root of that proper noun’s clause. This experiment yielded a whopping 14,000 verbs! We’re currently working together to go down this list and mark off good candidates, but it shouldn’t take long to get some good coverage: by analyzing the results, we found that the most frequent 56 verbs accounted for 50% of the captured sentences!

With our tested indicators on the way, we’re constructing a function which extracts the clearest and most concise information about what types of interventions patients will experience. After some experimenting with how much of the sentence to grab, we settled on the next function, extract_intervention. Starting at the indicator verb, this function captures any dependent subjects, e.g., “Patients”, and direct objects, e.g., “MRI scan”.

With this function, the example sentence above gets cut down to the most minimal information about intervention: “Patients undergo MRI scan.”

While the sentence above was fairly interpretable to begin with, the real power of this algorithm manifests in sentences such as this one:

Participants in group A will undergo eradication of H-pylori using triple attack therapy according to O’Connor et al, 2013 with Proton pump inhibitor (eg, omeprazole 20 mg BID), Clarithromycin 500 mg BID, metronidazole 500 mg BID for 14 days, followed by confirmation of eradication by repeating the H-pylori stool antigen test .

Which simply becomes: “Participants group A undergo eradication H attack therapy.”

The output cuts the sentence down to our topic, but once again is difficult to interpret since it is just an extracted fragment. However, since this information is not like the scheduling information (quantified and numeric) we’ll have to focus on standardizing extract_intervention’s output syntax to complete the job. Going farther with grammar functions would mean setting more syntax-based rules, but that will take a lot more time and special attention to case examples. Otherwise, our options will fall back into supervised machine learning. If some applicable pre-existing tools exist for syntactic simplification, we might be able to apply them quickly.

The bigger picture on rule-based systems

While these tools have made it easy to get moving on our patient-facing feature extraction, we’ve found the grammar-based approach to have some major setbacks. This won’t come as a surprise to anyone familiar with the NLP community: from 2003 to 2012, 75% of academic papers published on the topic of NLP used machine-learning, 21% used a hybrid system, and only 3.5% used rule-based systems. And the distribution has likely only gotten more skewed towards machine learning since.

While we’ve learned a great deal from trying to construct the simplest possible mechanism for extracting information from unstructured texts, these were the setbacks that we found limit this approach:

Improper grammar: A grammar-based approach can only work on sentences that… well, adhere to proper grammar. This was the biggest and least navigable setback, as we found that typos, fragments, and irregular grammar are endemic to our data set of trial descriptions. these irregularities inevitably cause a grammar-based system to fail.
Tendency towards overfitting: Because each of these grammar-based rules is hypothesized and tested individually by our team, there is a natural tendency towards overfitting to the examples we see the most. A machine learning system can better account for the full diversity of a data set, while balancing the strategic use of frequent patterns.
Interpretability of output: Extraction alone doesn’t necessarily generate descriptions that are easy to understand. One can see this in the sample output from our second function: “Participants group A undergo eradication H attack therapy”. While this is an improvement on the original sentence, it’s still not quite what we want: high-quality output that is easy for patients to understand. For this we need a system that can not only extract relevant information, but also present that information in an accessible rhetorical style.

Moving towards machine learning

With this rule-based system requiring more time to build, we’re looking towards machine learning to quickly develop a product that extracts patient-friendly information. These techniques ignore grammar, and don’t require intimate subject matter knowledge. That’s not to say that we’re discarding the rule-based system — we’re anticipating that the valuable information it can extract will constitute the first step in our NLP pipeline for the text simplification moonshot. Stay tuned for another update soon about this effort!