Solving Fundamental Problems with CS Fundamentals

Flatiron Health
Mar 12, 2019 · 6 min read

By Dan Ziring and Alex Padmos

Image for post
Image for post

Identifying patients who are eligible for clinical trials is one of the fundamental challenges the cancer research community faces. While there are several personal reasons that may dissuade a patient from participating in research, there are also many logistical barriers — identifying a patient at just the right time, when they are ready to be put on a therapy but have not yet started one, is often challenging when a practice may have dozens of trials open, each with a dozen or more inclusion/exclusion criteria, and with hundreds of patients coming into a practice a day.

This is where we think technology can help. However, people with software experience from other industries often assume the answer is one of two extremes: either a complex ensemble of ML approaches or a simple SQL statement comparing the patient data to the trial’s requirements. Unfortunately, the data available is typically dirty, incomplete, unlabeled, and complex. There is inherently no 100% right answer, no matter how sophisticated your approach because you’re never given 100% of the data needed to solve the problem. The “right” answer is any that performs better than the current solution, provides visibility into its workings, and could even be one you learned in CS 101.

When looking at a clinical trial, there are many different potential eligibility criteria: diagnosis, stage, age, prior therapies, and biomarkers are among the most common. As we’ve matched patients in the past, we’ve thought about this as a linear pathway:

  1. Does the patient match on disease? If yes, go to step 2. If not, they are not eligible.
  2. Does the patient have the correct biomarker mutations? If yes, go to step 3. If not, they are not eligible.

As we started to look at adding more criteria, and specifically biomarkers — measurable, biological indicators related to disease — we started to run into an issue: most trials were not so straightforward, and our current process for adding criteria wasn’t going to hold up. Take the very simplified criteria for a real multi-tumor study (across several diseases) with a few different biomarkers (in this case, ER, PR and KRAS):

  1. Is the patient over 18? If yes, go to step 2. If not, they are not eligible.
  2. Does the patient have breast cancer? If yes, go to step 3. If not, go to step 4.
  3. Is the patient ER- and PR-? If yes, they are eligible. If not, they are not.
  4. Does the patient have colorectal cancer? If yes, go to step 5. If not, they are not eligible.
  5. Does the patient have a KRAS mutation? If yes, they are eligible. If not, they are not.
Image for post
Image for post
A visual representation of the criteria above, showing a slightly more complicated trial

As we started to run into this problem, we participated in the TOP Health sprint to improve our trial-to-patient matching capabilities. Through this, we had access to a mock dataset which had been hand-curated by the National Cancer Institute. In this dataset, they structured their trial criteria as boolean expression.

For the one above, it might be represented as:

(Age > 18) AND (ER = Negative AND PR = Negative AND Disease = Breast) OR (KRAS = Positive AND Disease = Colorectal)

This led to a breakthrough — instead of storing each eligibility criteria type in its own silo (“Here are the disease criteria, here are the biomarker criteria”), what if they could be stored and evaluated in a common data structure? If we treat the boolean operators above as if they are mathematical expressions, we can create a classic expression tree (internally, we’ve been calling this a decision tree, but that’s something else).

Taking the example above, we can translate that into the tree on the right below:

Image for post
Image for post
On the left, a classic expression tree ((4–5)*3 + 9) from CS fundamentals. On the right, clinical trial criteria represented with a modified expression tree.

When we evaluate each node, we bubble up its result to the node above it, eventually getting to a result of whether a patient is eligible for this trial.

In the example below, the patient has breast cancer and is ER- and PR-, but doesn’t have colorectal cancer and hasn’t tested KRAS+. Despite that, they are eligible for the trial because of the left subtree.

Image for post
Image for post
An example of the expression tree evaluated for a patient. In the example below, the patient is over 18, has breast cancer, and is ER- and PR-, but doesn’t have colorectal cancer and hasn’t tested KRAS+.

Each leaf node in the Flatiron clinical trials decision tree represents a single inclusion or exclusion criterion. The nodes (and their criteria) are mixed and matched into different trees to form the criteria for different trials. In the TOP Health sprint, we parsed experimental criteria expressions to form these trees, and supplemented it with data from clinicaltrialsapi.cancer.gov. As we move this into production, the eligibility criteria isn’t clean enough for us to confidently use it in these mappings, so we are going to work with our partner sites to input the criteria for the trials open at their sites.

However, we’ve found that one of the biggest impediments to effective trial matching is the lack of trial criteria information, meaning that we’ve been limited in their ability to suggest trials for which a patient is actually eligible. With the comprehensive trial information provided for the sample dataset in TOP Health, we were able to greatly improve our ability to match attributes on the mock patients. If the cancer.gov trials API included something similar to eligibility expressions for each trial, it could create a template for both patient-to-trial and trial-to-patient matching that developers across the industry could follow, and would have an enormous benefit for patients.

Using an expression tree preserves the visibility and explainability that comes with a deterministic matching approach while allowing us to query entirely disparate data sources, with match rankings, through a unified interface. Moving forward, this also enables us to incorporate machine learning into our approach, having nodes return probabilities instead of absolutes.

When written out, the approach is nothing complex. Each leaf node has a job — for instance, it knows how to take a patient’s clinical information and return whether the patient matches on disease. Adding a new type of criteria involves only adding a new type of leaf node. When you consider that each node has the freedom to do something as complex as call out to a machine learning model for help making its decision, it’s tempting to use words like “system” or “ensemble coordinator” but… it’s still just a tree. A simple data structure that can help solve complex problems with the right nurturing.

Appendix: Runnable Code

In an effort to make this post more concrete, we’ve included a simplified version of our matching tree code below to show how it works together. There are two “Operator” classes (AND and OR), which can have children added to them, and two “Leaf” classes.

For the sake of simplicity, our equivalent of a patient is MockMatchable, a namedtuple that has a number and a series of letters associated with it. One leaf class, LetterMatchLeaf, only allows a certain subset of letters. The other, NumberMatchLeaf, only allows numbers less than a certain number. In our production system, we have leaves such as DiseaseMatchLeaf and BiomarkerMatchLeaf, but the effect is the same.

At the bottom of the code, we build a simple tree and evaluate different MockMatchable objects against it. The tree could return probabilities, but in our simple example, it only returns 1 or 0. Our nonsensical tree looks like this:

Image for post
Image for post

We’d encourage you to play around building different trees, new leafs, and MockMatchable objects to see how we can build upon this.

Disclaimer: There are many ways to improve this code, but in the interest of brevity, we’ve kept it short — please ignore the lack of real inheritance and other things you might expect in production code.

Flatiron Engineering

Thoughts from the Engineering team at Flatiron Health.

Flatiron Health

Written by

Our mission is to improve lives by learning from the experience of every cancer patient.

Flatiron Engineering

Thoughts from the Engineering team at Flatiron Health. We're building technology that enables cancer researchers and care providers to learn from the experience of every patient.

Flatiron Health

Written by

Our mission is to improve lives by learning from the experience of every cancer patient.

Flatiron Engineering

Thoughts from the Engineering team at Flatiron Health. We're building technology that enables cancer researchers and care providers to learn from the experience of every patient.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store