4. How we got started with supervised machine learning

Our classifier caught 60% of the false positives—without human intervention

Published in

Clinical Trial NLP Challenge

5 min readJun 18, 2018

So far, we’ve found two major problems with using rule-based systems for converting complex clinical trial descriptions into accessible, patient-facing information.

The first is that rule-based systems are prone to imprecision. For every rule in grammar, there seem to be ten exceptions! Thus, constructing a rule-based system which can account for all of these exceptions requires an enormous investment of time and energy — something clearly outside the scope of this project (though a great example of such a system which has been personally useful to your author is Passyunk, an address parsing system from Philadelphia’s Office of Innovation and Technology).

The second drawback we’ve learned of is that extraction does not equal readability. While we’ve been able to get out valuable information, it’s often not formatted in a way that’s easy to understand.

In both of these cases, the most reliable fix would be…well, to have a human help! Humans are pretty great at understanding context and summarizing, the two tasks our rule-based system fell short on.

The point of supervised machine learning is to have a computer mimic what a human can do. The supervision means that a human has to give a model of how they would complete the task, and then the computer learns by example. Below, we’ll give two examples of how we’re using supervised machine learning with these NLP obstacles.

Statistical classifier for precision

We’ve noted in previous posts that precision is a recurring problem when trying to extract reliable information from unreliable free text. While our rule-based system has very good accuracy — it catches most instances of information about scheduling— it also has many false positives.

To bolster the precision of the system, we’re adding to it a topic relevance classifier. Remember our discussion of annotating false positives last week?

A topic relevance classifier is simply a fancy way of saying that we’re having a machine do this exercise for us, by dumping the annotations from that exercise into a statistical classifier.

A quick and painless taxonomy of supervised machine learning techniques

This exercise required a classifier, because we were seeking a label as our output: a sentence either is relevant to scheduling, or it’s not. This is what classifiers are built for, as opposed to the other type of of machine learning algorithm: a regression. Jason Brownlee from Machine Learning Mastery explains the difference between these two types of algorithms best:

“Fundamentally, classification is about predicting a label and regression is about predicting a quantity.”

Each of these two categories has many different kinds of algorithms within it. There are also other categories, such as clustering and dimensionality reducation — machine learning is a wide and wonderful world! This diagram from scikit-learn, the most popular python package for machine learning, gives a more complete picture of all these different types of algorithms:

For this blog post, though, you just have to know about classification and regression!

Back to our regularly scheduled programming

So we want to use a classifier to know if a suspected instance of scheduling information caught by our indicator words is actually something we want to show patients. We chose a support vector machine (also known as a support vector classifier, SVC, as in the diagram above) for this task. An SVM was well-suited to this task for two reasons: data abundance and memory efficiency. While SVMs do not perform well in data-sparse environments, we have plenty of clinical trial descriptions on which to train the algorithm. SVMs are also very memory efficient, because they only use a subset of the data for each decision function — these decision functions are also called support vectors, where the algorithm got its name.

Before adding the topic relevance classifier to our system for detecting scheduling information, 44% of the sentences it caught were found to be irrelevant. After adding in supervised machine learning, we cut this number down to 17.5%!

Creating a scheduling gold standard

Supervised machine learning requires a “gold standard” — a dataset that contains an example of what we want our algorithm to produce. Usually, these have to be created by manually annotating data from the original dataset. The example above of marking false positives is an example of this.

While doing a more involved annotation beyond just a yes/no classification at the very beginning of our project would have been overly ambitious, but since we’ve narrowed down the type of information we want to extract, annotations have become a very useful process that will fit within the project’s scope.

We’re focusing our annotations on capturing comprehensive scheduling information from the clinical trial descriptions. To do this, our team is going through 1,000 trial descriptions from our original dataset, and annotating each with our interpretation of:

the duration of the trial,
the duration of the follow up period,
the number of visits, and
the duration of each visit

Doing this will allow us to build a supervised machine learning model that can do exactly what our annotators did: both interpreting the information in the trial description text, and presenting it in a neat, readable format.

The annotation interface

So far, we’ve had a bunch of sample clinical trial descriptions listed out in a Google Doc, where we list beneath them our interpretations of these four pieces of information. For now, we’re using a simple Google Doc while we go through a few examples to make sure these 4 pieces of information are usually present, and paint a reasonably comprehensive picture of burden scheduling.

To do 1,000 annotations, though, we’ll have to develop a better interface for annotating than Google Docs. This is a step in most supervised machine learning projects.

Since 1,000 is still a relatively small number and we’re on a tight schedule, we’ll likely end up using a spreadsheet with one column for each of the four pieces of information to be filled in, alongside all the sentences from a trial description flagged as relevant to scheduling by our topic relevance classifier. However, for larger projects, you can use pre-built tools for data annotation such as Dataturks.

Using regression

Because our output in this task is quantitative, instead of a classifier, we want a regression model. Once we’ve completed our annotations, we’ll train a model such as a decision tree or random forest to estimate the four pieces of scheduling information for each trial description.

An important task with building a model that predicts quantities is communicating the margin of error to patients. As we’re annotating, we’re thinking about error quantification, with the goal being to provide patients with confidence windows for the quantitative values. This could be as simple as showing a range, such as “5–7 visits”, or perhaps by producing a pictoral representation of the range.

Stay tuned for more updates!