Challenges of Translational Data Science

Shannon Quinn
The Quarks
Published in
6 min readMar 9, 2018

Two weeks ago, I had the pleasure of attending Georgia Clinical & Translational Science Alliance (CTSA) 2018 Statewide Conference at Château Élan.

Source: http://georgiactsa.org/images/news-images/stateconf18.JPG

This meeting gathered together about 250 of Georgia’s foremost clinical and translational researchers from all of its major universities, featuring keynote speakers, poster sessions, accepted talks, and panels. Much of it was admittedly beyond my immediate area of expertise, but translational work was why I wanted to get into academia in the first place: to blur the lines between theory and practice and create things that are useful to people outside my field and even my profession.

I’ve built a career from early in my undergraduate studies of being interdisciplinary and working with people whose expertise is distinct from my own. I’ve published with clinicians and other translational researchers, and have helped generate initial findings for R01 grants. And yet I always have to remind myself, as someone who still spends most of his time working with cleaned and standardized data sets, when I dive back into truly, honest-to-deity translational research:

It’s messy.

One of my group’s biggest projects is developing models of ciliary motion.

Source: http://dx.doi.org/10.1126/scitranslmed.aaa1233

Cilia are microscopic hairs that line the exteriors of pretty much every cell of your body. Their coordinated motion is implicated in a variety of processes and functions, including keeping the lungs clear of particulates and irritants, cell-cell sensory signaling, early neural development, and other signaling mechanisms; as a result, disordered ciliary motion is implicated in numerous disease pathologies.

Suffice to say, recognizing disordered motion is important. Unfortunately, there’s really no objective definition of “disordered” motion. Part of the problem is institutional: protocols for acquiring cilia biopsies, recording the motion, and assessing it are not standardized, and the assessment process is entirely dependent on the individuals — their training, background, and biases — actually performing the assessment.

The other part, however, is a feature of clinical and translational research: “disordered” motion, at least as it is currently understood, is more than what is directly observed in biopsies. The diagnostic process consists of additional metrics beyond motion assessment of biopsies: transmission electron microscopy can examine the underlying structure of cilia for deformities (if there actually are deformities), nasal nitric oxide levels correlate with certain pathologies (except when they don’t), and ciliary beat frequency can be an effective indicator (sometimes).

Each of these metrics has weaknesses on its own; we discovered this early into what would ultimately become our 2015 publication, when we attempted to design a simple classifier based on ciliary beat frequency. It would work… sometimes. Initially we would write off the misclassifications as either the result of insufficient preprocessing, or products of recording artifacts, or simply outliers that were to be expected. It had to be pointed out to us that this was how the system worked — “it’s a feature, not a bug” — that there were real biological reasons why this metric told only a very small fraction of the full story.

The take-away is that the broader clinical context of the patient — each metric in combination, and the patient’s entire health history — will paint the most complete and accurate picture of the pathology for the clinician to make a diagnosis.

A few months ago, Andrew Ng’s lab at Stanford made waves in multiple disciplines with the release of his paper purporting superhuman algorithmic performance in assessing clinical conditions from chest X-rays. It’s a great example of application, similar to the problem of ciliary motion analysis: with no quantitative, objective baseline, why not see if a computational pipeline could at least reproduce human results? Even better if it exceeds them!

Source: https://arxiv.org/abs/1711.05225

I encourage you to read the paper. My first impression was that it was a pretty standard application of well-understood principles of convolutional deep networks in the context of semantically-segmented images with discrete, categorical labels. In my view (and that of others), the biggest and most meaningful contribution of this work is actually the dataset; the task of assembling such a large, clinically-validated chest X-ray image dataset was nothing short of Herculean (over 112,000 frontal chest X-ray images from over 30,000 patients) .

From a translational standpoint, though, things are a little murkier. There were some questions about the method through which labels were obtained (a small number were done manually, but the majority were obtained through natural language processing of the associated radiography reports). There were concerns the real-life gradations between certain diagnostic categories weren’t reflected in the model. Even the model itself came under some criticism. Check out the various critiques in detail.

The authors, to their credit, did a great job embracing open and transparent peer review. While the clinical findings of the paper may be a bit overstated — at best, it seems to perform at about the same level as highly-trained clinicians; obviously it outperforms humans without the same level of training — that doesn’t mean it’s not worthy of consideration. As stated, the dataset by itself is a huge achievement, and we shouldn’t expect to “solve” clinical diagnostics in one paper.

This hits on a lot of common themes, but if there’s one take-away from this post, it is the following two essential components:

When working in clinical or translational research, you need a thorough clinical understanding of the problem, and [an] interpretable model[s].

Source: https://imgs.xkcd.com/comics/machine_learning.png

Without getting too into the weeds here, one of the current swirling debates in the machine learning community is about the role of interpretability. There are plenty of articles and blog posts out there, but this one hit home for me (emphasis mine):

What I am against is a tendency of the “deep-learning community” to enter into fields (NLP included) in which they have only a very superficial understanding, and make broad and unsubstantiated claims without taking the time to learn a bit about the problem domain.

One of the reasons I’m so proud of our 2015 Science Translational Medicine paper is because of the learning experiences after all the science was done (of course this was extraordinarily annoying at the time, but in retrospect I appreciate it much more). It spent over a year in review, during which time we got hammered by the reviewers on our use of the term “diagnose,” a term which has deep clinical relevance that is not at all synonymous with “classification accuracy.” If you read through the paper now, you won’t find that word used in any kind of context related directly to our work.

We in Data Science / Machine Learning / Artificial Intelligence have more powerful tools than ever at our disposal. The temptation, then, is to find some data X and accompanying target y, close our eyes and point to a deep network architecture, and press PLAY. Not only does this not work in a clinical or translational setting, but I would argue has little to no educational benefit, either. Aren’t you curious why you got 82% accuracy, and not 96% like the leader on Kaggle? Explore the structure of the data; view it from different angles. Maybe there’s a clever trick with an ensemble model that would boost your accuracy above 90%. Maybe some of the labels are wrong!

The real world is messy. That admittedly makes our lives harder. But rather than ignoring the messiness and working with an oversimplified problem, view it as an opportunity to really flex those data science muscles and embrace the messiness as a challenge! Establish a dialogue with the clinicians. Iterate on a few simple models. Diagnose what went wrong, and what insights they revealed. Talk more with the clinicians. Iterate on more models, increasing the complexity. Discuss the results and the mistakes. Learn more about the data, how it was gathered, where it might be flawed, and how that could be showing up in subtle ways in the model. Try more advanced strategies. Keep dialoguing.

I don’t know about you, but that kind of challenge really sounds like fun.

--

--