1. How to make sense of 270k clinical trial descriptions

(Hint: start with regular expressions)

Published in

Clinical Trial NLP Challenge

4 min readApr 12, 2018

The team at Drexel University is now one month into our experiment with extracting info for patients from esoteric clinical trial descriptions. For this first part of the project, we’ve been focusing on retrieving the greatest amounts of information possible with the simplest techniques. Here’s a look at what we’ve been up to:

The data

We started by collecting descriptions of 268,018 clinical trials from clinicaltrials.gov, which amounted to 5.5 GB of data. The descriptions come in XML format, with some information like patient age range in their own tags, but the majority of the information we want in unstructured text blocks. For example:

<eligibility>
  <criteria>
    <textblock> Inclusion Criteria: - Patients with a clinical indication for EVAR/FEVAR of AAA and meeting anatomic inclusion criteria on preoperative enhanced CT-scan compatible with an endovascular repair...
    </textblock>
  </criteria>
  <gender>All</gender>
  <minimum_age>N/A</minimum_age>
  <maximum_age>N/A</maximum_age
  <healthy_volunteers>No</healthy_volunteers>
</eligibility>

Our first task was to develop preprocessing tools in Python for this large dataset. After implementing an algorithm for parsing the XML and converting it to JSON, we then used regular expressions to implement features for standardizing whitespace, extracting URLs, etc.

First pass: extraction via keyword indicators

The first approach we’re testing is using keywords to find important information in the unstructured text. The idea behind this approach is that certain kinds of information have a set of words that almost always appear within or alongside them. For instance, if we want to tell patients how long they’ll be actively participating in a study, we know we’re looking for a number plus a temporal unit: “Patients will receive daily injections for 5 weeks.”

To implement this idea, we developed a hierarchical schema of categories of information most relevant for patients. We designed the categories from our manual analysis of 100 sample trial descriptions. The categories include the information most vital to patients’ understanding of a trial: active participation period, follow up period, and eligibility criteria, to name a few.

Next we labelled each category with its datatype, and tagged each with keywords that appear in or near the information we want.

To assess the validity of the keywords we chose, we conducted frequency evaluations of each indicator. Here’s a sample of what the output looks like:

type of information, containing XML tag, indicator, and frequency

You’ll note this example contains indicators with regular expressions. The figure for frequency represents the percentage of descriptions in our dataset of 268K trials that contain that indicator. We found from this analysis that indicators associated with the category of patient burden (which includes active period, treatments administered, etc.) occurred with relatively high frequency, meaning it could be the most fruitful category of information for us to focus on.

Next up: extraction via parts of speech

In cases where the positioning of information and keyword are highly predictable, such as the “5 weeks” example, we can easily use regular expressions to extract the target information. Unfortunately, most indicators and information pairs do not occur in such a predictable pattern. Thus our next step for finding information based on presence of keywords is to parse sentences into syntax trees. This is motivated by the idea that even if a given noun keyword doesn’t always occur adjacent to the information we want, the indicator’s part of speech will have a stable relationship with the part of speech of the target information. Take, for example, the sentence fragment:

“ …the nurse completes the evaluation with a mouth inspection in order to count missing, coloured, injured teeth, and evaluate dental plaque, halitosis, inflammatory gums, mucosal injuries, low masticatory surface.”

Suppose “evaluate” is one of our indicator words. Because this keyword is a transitive verb, we know we’re interested in extracting its object(s). In this case, we need a flexible method of extraction that can gather more than one object being evaluated; simple extraction based on word positioning would fall short.

To implement this more complicated analysis, we’ve already begun using the Python module Spacy to output syntax trees. This type of analysis focuses on word parts of speech and dependency relations. For example, this allows us to compute subject/object relations between nouns and verbs. While available tools make accessing this information straightforward, determining reliable grammatical rules that extract target information is a challenging task.

In our next post we’ll detail progress along this grammatical approach.

1. How to make sense of 270k clinical trial descriptions

(Hint: start with regular expressions)

The data

First pass: extraction via keyword indicators

Next up: extraction via parts of speech

Written by Amy Gottsegen