Mining Health Data: The New Gold Rush

Kelvin Chan
Unraveling Healthcare

--

“The really important thing about data is the more things you have to connect together, the more powerful it is.” — Tim Berners-Lee, founder of the World Wide Web, TED Talk 2009

Startups are generating types of data never before available in health. But it’s not just the emerging quantity of data that’s exciting (the buzzword being “big data”), but the possible connections between them. And it’s in these data linkages where significant health innovation may lie.

Unfortunately, healthcare data is still largely disconnected as it stands today. And in this void, health companies are rushing left and right in search of their role in the “connected” data sprint.

In 2016 alone, Apple and Google both announced mobile enhancements to facilitate health data transfer natively for developers. Self-described data-driven insurance startup Clover Health raised a large Series C round of $160M. In the land of M&A activity, health IT company NantHealth acquired NaviNet in January; IBM acquired Truven Health Analytics in February; and most recently in May, data-provider IMS Health merged with Quintiles.

While it would be an oversimplification to suggest that these corporate transactions are solely driven by a mission to connect data, the strategic potential of data interconnectivity is hard to ignore.

To fully appreciate the potential of “connected data” and what these companies might be doing, I’ll start first with what the term means.

What does Connected Data Mean?

Links between data are valuable because data can describe other data. As a quick analogy, when you search Wikipedia for a drug (or any topic for that matter), you’ll encounter hundreds of sub-links to other articles. Think of each linked article as another data set. Take “Advil” for example: you search [Advil] and find that it’s a brand name for [ibuprofen] which is a type of [NSAID drug] which are a class of [analgesics] that work by inhibiting [COX-2 enzymes]. Each [object] contains its own associated, rich Wikipedia article of information/data, each with even more links to other [objects]. Only by reading through it all can you gain a full appreciation for what Advil is.

Realistically, none of us have time to read through every associated article. We read only enough to answer the question we’re seeking. For example, if we wanted to know what Advil does, we’d learn immediately from reading the first paragraph that it treats pain. But what if we were investigating other potential COX-2 inhibitors that could treat pain beyond Advil? What if we wanted to discover other enzymes responsible for pain in the human body?

While we may not have the time nor speed to do all that reading, computers do and at speeds no human could match. Hypothetically, a computer trained to recognize all the links across data could uncover answers to questions you didn’t even know you could ask.

What Questions can Connected Health Data Answer?

It’s hard to predict all the questions we can ask of connected health data; however, a couple key areas are emerging.

*I’ll be referring to my categorization of health startups for this explanation, which you can read more about here.

Health Data Hubs. To see the original startup categorization map, click here.

[Patient Empowerment] Question 1: Can we understand and analyze patient health in the real world when they’re outside a setting of care?

There’s been historically little insight into a patient’s life outside the auspices of physician care. However, it’s these periods of a patient’s life that may be most important. By the time patients end up at a doctor or hospital, the ability to dramatically improve care is already severely limited. Pre- and post-acute care in tracking diet, exercise, or even drug adherence are all critical lifestyle factors that can keep patients out of risky and costly hospitalizations.

With the emergence of new health data sources produced from apps, wearables, or other “Patient Empowerment” startups, we are slowly able to paint a picture of a patient’s pre- and post-physician journey. The only way to paint a complete picture, however, is by unifying and linking the data produced across these various vendors of data.

  • [Examples] Current data may tell us: number of steps taken, nightly sleep habits, drug adherence, daily blood pressure
  • [Examples] Linked data could tell us: impact of medication or medication adherence on sleep and exercise, efficacy of medication on reducing blood pressure with or without exercise

Such “real world” data is unavailable in the controlled environments of clinical trials. Linking such data can provide insights into how diseases, treatments, or drugs manifest in a real world setting and impact health/lifestyle indicators for a patient.

[Healthcare Coordination] Question 2: Can we unify clinical insights from EMRs or claims data to understand how patients transition through different settings of care?

Provider-friendly platforms are only one facet of “Healthcare Coordination.” Another crucial and complementary aspect of coordinating care will be about unifying patient data.

Patient medical records are often fragmented across various settings of care(e.g., in-patient, out-patient, family physician, etc). Only by unifying these datasets can one start uncovering your true medical history.

  • [Examples] Current data may tell us: medical charts with family doctor, hospitalization data, blood testing lab results, drug side effects data
  • [Examples] Linked data could tell us: medical insights sharing across healthcare professionals, gaps in care or unnecessary procedures, drug side effects or safety issues identified across patients (e.g., FDA Sentinel Initiative)

[Payment Reform] Question 3: Can we start measuring the true utilization and costs of healthcare by drug or procedure?

As clinical records unify, a parallel effort has been made to link these records to healthcare utilization and costs. Such metrics have historically been difficult to understand or calculate. Are procedures too expensive? Too underpriced? How often do patients get X procedure? If so, in what scenarios?

Continued adoption of value-based payment (VBP) models, which tie payments to treatment value, are often hindered by inaccurate estimations of what the value for a procedure or drug should be (i.e., how does one accurately calculate bundled rates for an episode-of-care or decide what services to include in that bundle). The only way to advance these calculations is through comprehensive, linked data.

  • [Examples] Current data may tell us: claims data by members within their network, cost range of drugs or procedures for patient, outcomes metrics by patients of certain procedures
  • [Examples] Linked data could tell us: identification of high risk/high cost patients based on prior procedures, real world costs of different episodes-of-care, cost-effectiveness or value frameworks for drugs or therapies (see Memorial Sloan Kettering’s Drug Abacus or ASCO’s Value Framework for early iterations).

While the CMS (Centers for Medicare and Medicaid Services) has been leading a government charge to open up such claims data, much of this data today remains locked by private payers.

[Personalized Medicine] Question 4: How can we leverage academic research data and emerging consumer genomics/diagnostics data to find targeted, individualized therapies?

Medicine has historically been built on population health guidelines — that is, if it works for most people, it will probably work for you. Statistically, there will always be edge cases where a particular drug or treatment does not work for you. And ineffective treatments are wasteful and possibly dangerous.

Personalized medicine is about identifying these various factors through analyzing new, previously unavailable data indicators. With greater access to academic and personalized data, there’s been a focus on uncovering more targeted treatment options when it comes to drug or clinical guidelines.

  • [Examples] Current data may tell us: genomic sequencing data, proteomics pathway data, consumer genomics or diagnostics data, drug interactions data, drug safety data
  • [Examples] Linked data could tell us: biomarker detection by gene on protein to target drug usage, new novel pathways for drug interaction and drug discovery, links between drug efficacy and safety to biomarkers

The Barriers of Linking Today’s Health Data

Unfortunately, training a computer to identify links between data is not easy. Imagine if there were no links in the “Advil” Wikipedia article — would you know what and how to find related articles? Possibly. We may recognize from contextual clues in the article that the “COX-2 enzyme” is an object that’s relevant to our question. But could a computer?

Many technical challenges exist in linking data, and it’s a problem that plagues all industries. But these challenges are further complicated in health where the bulk of today’s traditional data is amassed from a) indirect measurements and b) imprecise inputs.

a) Indirect measurements negatively impact the specificity of the data produced. Indirect measurements are proxies when measuring from the source would prove difficult or impractical.

  • A good example would be today’s golden industry standard of drug sales data from IMS Health. Such data is collected from “wholesalers,” which buy drugs in bulk from pharmaceutical companies. Any activity beneath the wholesaler level (i.e., at the patient level), however, is often obstructed. Deeper, more granular analysis regarding other distribution channels for drugs, real time drug utilization, net discounts or rebates by hospital or pharmacy, etc. is difficult to surface.
  • Another example would be claims data. Claims data contains a listing of all billable procedures performed on a patient. Unfortunately, this is an abstraction of all the services that a patient receives. Office admins need to translate procedures to billable codes, which is often a source of error. It also omits non-billable procedures, patients that switch insurance plans, and more specific diagnostics-level information.

b) Imprecise inputs are inputs that elicit responses that are unstructured and inconsistent. Computers, unfortunately, don’t possess the same level of semantic understanding that we do. Rather, computers are best at interpreting consistent, structured data (a la rows and columns) that simplify the need for the semantic understanding required of free-form text. The majority of health data today lies somewhere between “structured” and “free-form text.”

  • One example would be clinical EMR data. The world of EMR data is largely unstructured as a result of free-flowing medical records, nurse or physician notes, hospital discharge data, non-standardized diagnostics or lab data.
  • Although standards in clinical data collection are emerging, no consensus exists that grants providers the ease and flexibility to uniquely record patient notes while not simultaneously impeding the workflows to which they provide care.

With these barriers in mind, not only has it been tough to find links between health datasets, but the utility of doing so is unclear.

Note: Many of these data obstacles are not unique to healthcare. One interesting parallel can be seen in TV Nielsen ratings. To read about this analogy and how this industry approached such challenges, click here.

The Health Data Connectivity Arms Race

Despite the obstacles, health companies and startups are diving in heads first to leap ahead in the connected health data arms race. The first ones to demonstrate accurate data links and surface repeatable insights will find the most success.

Two (not mutually exclusive) overarching strategies have taken shape:

Health Data Connection Strategies
  • A) Data first: Aggregate and clean existing health data. Only after messy data has been cleaned can insightful data connections be surfaced.

Companies like IMS and IBM are making land grab mergers and acquisitions to aggregate health data. Even if the data’s not clean, there’s a mass of health data that already exists, and companies need to own it before they start playing with it and building links.

Many of the computing buzzwords you hear are likely in this bucket (Natural Language Processing, Cognitive Computing, Data Wrangling/Munging, etc). These are all effectively tools for computers to help structure or make sense of messy, indirect, and inaccurate data.

  • B) Platform first: Build and integrate platforms that encourage clean, structured data output. As the platform scales, so will the clean data it spits out. Only then can direct, accurate data links be routinely and consistently made.

Existing health data is so messy that perhaps it’s easier to start at the platform-level. The imprecise inputs of old claims or EMR systems are leading to a disarray of unintelligible messy data.

Instead, enhancing the interoperability of platforms that encourage clean data is another option. For Apple and Google, this means promoting developer-friendly SDKs or APIs that encourage standardized health data exchanges across the apps and devices that sit on their platforms. Building and scaling platforms with friendly UI and precise inputs may be the key to a consistent output of integrated, clean health data.

There are challenges to both strategies, and most health companies are likely attempting some form of both. For a truly connected world of health data to exist, both pathways will need to be pursued. As new mediums of health data emerge, so too will health data technology need to sprint to make sense of it. In the end, only the healthcare companies that can make sense of the expanding universe of old and new mediums of health data will prosper.

Kelvin is a healthcare professional working at Enigma.io. All views are his own.

--

--

Kelvin Chan
Unraveling Healthcare

Healthcare professional working on how data can help solve many of today’s current health problems. Former consultant in drug strategy. All views are my own.