Mining Gold: Healthcare Data and It’s Potential to Unlock Patient Benefit and Create Value

Lusi Chien
Outlier Ventures
Published in
8 min readJun 16, 2021

With the advent of big data and AI, health data can allow for the next generation of personalization of treatment, development of new drugs, and possibly prediction and prevention of chronic diseases.

This three part series looks at

  1. An overview of health data
  2. How companies have used it to move from personalization to prediction to prevention
  3. How to build a data business as a start-up
Su Huang from Datavant

For part I, I got a chance to interview Su Huang, head of data strategy at Datavant. Founded in 2017, Datavant connects healthcare data to eliminate the silos of healthcare information that hold back innovative medical research and improved patient care. They help data owners manage the privacy, security, compliance and trust required to enable safe data sharing.

Lusi: Hi Su, thanks for helping me give my readers a look into all the different types of data out there. Can you describe what are the main types of healthcare data?

Su: “Traditional” data that has been used in healthcare research for years, these are

  • claims data — which is the bread and butter of healthcare research
  • EHR data
  • labs data

Lusi: Let’s go into each of these, tell me first about claims data

Su: Claims data has been used for decades and tends to be the most longitudinal in nature since you can follow patients from provider to provider, even if they use different EHR systems. Example players in this space include: IQVIA, IBM MarketScan, Optum Claims — a division of United Health Group, and Definitive Healthcare.

Lusi: I’ve used Definitive Healthcare for several of my start-ups, but it seems that the bulk of its offerings are Medicare data which tend to be a couple of years old. I know they’ve started to introduce commercial data recently as well. What are the limitations of the current state of data?

Su: Some of the differences to keep in mind come from how these companies source the underlying claims data that they have. There are two broad distinctions in claims databases — some databases such as those provided by IQVIA and Symphony Health aggregate data from many different claims clearinghouses and provider billing systems. These large data aggregators will cover more patient lives, but may not have 100% claims capture for each patient. Other databases such as those provided by IBM MarketScan, Optum, Blue Health Intelligence, and Inovalon Insights, have claims data from health plans or self-insured employer groups. The claims databases offered by these companies tend to cover fewer patient lives, but will have 100% claims capture for each patient. It’s important to understand that all these providers de-identify the data first, stripping away identifiers (e.g. name, date of birth, phone number, address, etc) so that patient privacy is not compromised but the value of the health data is retained for analytic purposes.

Other factors to keep in mind as you evaluate claims data providers are data latency, granularity of data fields, permitted use cases and pricing.

Lusi: How about EHR data, does it address some of these concerns?

Su: EHR data is less longitudinal than claims data, but you can obtain greater detail on a patient’s condition, symptoms, and disease progression by using EHR data. The challenge with EHR data is that as much as 80% of the data could be in physician notes which sit in unstructured data fields. Thus, curating EHR data first will maximize the utility of the EHR record. Companies like Flatiron, Ontada, and ConcertAI do this with oncology EHR data. Cerner, one of the largest EHR providers for hospitals, is starting to develop real-world data offerings for researchers. Start-ups like OMNY Health are creating platforms for hospitals to be able to commercialize their data and make it available for analytics.

Lusi: What are ways start-ups can add value to EHR data since it is so largely unstructured?

Su: There is a lot of opportunity to apply natural language processing and machine learning techniques to EHR data. Many of these aforementioned companies still rely on a process called clinical abstraction, which means deploying teams of people with clinical backgrounds to curate data from unstructured fields into structured fields. It is a heavily manual process today. Startups can apply data scientists and engineers to solving this big problem and making it more seamless to organize and use unstructured health data. Startups like Mendel.ai are focused on solving this problem of automating the translation of unstructured into structured, analytics-ready data, as are big companies like Amazon, through their HealthLake product.

Lusi: Finally, let’s talk about labs

Su: Approximately 70% of diseases are diagnosed through lab testing, making it a vital part of understanding a patient’s journey. Therapeutics are also becoming increasingly biomarker driven, which makes lab data, especially genomic lab data, a key part of selecting the right treatment for a patient. You can obtain lab data from analytic platform companies like Prognos Health, which aggregates and normalizes data from many different labs. Or, you can work with individual labs themselves who may have teams focused on licensing de-identified data or conducting analytic projects on their data.

Lusi: I know Tempus started out as a lab service but has really hit it out of the park with a $1B+ valuation. There are also various new entrants looking for easier ways to test at home such as Nephrosant with urine-based tests and Qvin using menstrual blood.

Lusi: You mentioned traditional sources of data, what are some emerging ones?

Su: There are so many other sources of healthcare data, such as genomics data, remote monitoring device data, fitness tracking data, telehealth data, disease-specific registries, and imaging data. There’s also a broad category of data called Social Determinants of Health data (SDoH) which are conditions in which people live, learn, work, and play (e.g. where you live, your education level, your access to transportation, etc) that health researchers are increasingly interested in because they impact health outcomes. All of these data types are broadly referred to as Real World Data (RWD) which is defined as any data that relates to a patient’s health and delivery of care that is routinely collected in a person’s regular life (and outside of the clinical trial setting).

Lusi: It’s interesting that you mention imaging data. Having just come from the imaging space, I am aware of several players who have a lot of imaging data, but it seems like the primary use case for the data is to train AI algorithms for disease detection, and not leveraged as much beyond that.

Some existing imaging data players include:

  • American College of Radiology
  • PACs vendors such as Change Healthcare or LifeImage
  • Start-ups such as Segmed
  • Hospitals and Image Centers such as Stanford, Partners Health, Radnet, and RadPartners
  • AI in Medical Imaging players such as AIDOC, Subtle Medical, Clarify and Zebra, though they’re not so much commercially sharing the data itself vs using it for their own research

Su: There is definitely interest in imaging data, but typically, RWD needs to be de-identified first (unless you have patient consent to use their identified data for research) and ideally turned into structured data. So, I think that is one of the challenges to broader adoption of imaging data in RWD research. Additionally, as with other complex types of big data such as genomics, there can be challenges to finding the right people with clinical expertise to process and interpret this kind of data. Lastly, there are technical challenges to storing, querying and analyzing imaging data for RWD use cases.

Lusi: It seems that previously data had been used more for sales and marketing (e.g. determining market share, market size, etc) but now it’s moving more towards influencing personalized treatment and even future drug discovery?

Su: Data for commercial use cases is still a large area of spend for pharmaceutical companies. However, use of RWD in the clinical trial space is accelerating especially in areas like oncology where the populations that need to be recruited for clinical trials is becoming increasingly precise, and there are more long-term follow up requirements. The FDA’s 21st Century Cures Act, enacted in 2016, helped accelerate the acceptance of using RWD to modernize clinical trial design and speed up the clinical development process. A very tangible way in which RWD is accelerating trials is by enabling synthetic control arms (SCAs) — this means that instead of recruiting patients for a traditional control arm, you design the control arm by using the real-world health records about a cohort of patients. This reduces recruitment time and overall cost of a trial. BCG has a great article about SCAs here.

Lusi: It is clear that one source of data is no longer sufficient, and the “smartest” solutions are combining EHR, imaging, genomics and labs data all together. Who are effectively using multi-model data and what remain the challenges of stitching everything together?

Su: Yes, that is the smart approach given the wealth of data we have today. Enabling data to be stitched together while maintaining patient privacy is what Datavant’s technology offers. There are many companies that are doing this effectively from big pharma manufacturers like Janssen to analytic companies like Komodo Health, TriNetX, or IPM.ai. There has been a particular explosion of healthcare analytic companies that are applying advanced analytic techniques to massive amounts of aggregated data. Many start off with one or two specific focus areas and then expand into other use cases for their platform. For instance, Komodo Health has built a real-time Healthcare Map to help with patient journey understanding. TriNetX has a platform to help sponsors with clinical trial site selection and recruitment. IPM.ai can build models on aggregated data that help rare disease patients achieve earlier diagnoses. These are many others as well.

Thanks so much for sharing this overview with me today and there definitely seems to be a lot going on in the healthcare data space. We will talk in Part II about how all of this data is making a difference in patient lives by moving treatment from personalization to prediction to prevention.

--

--

Lusi Chien
Outlier Ventures

Lusi is a global commercial leader in the Healthcare Life Sciences space, launching the latest AI and medical device technologies to help patients