Privacy landscape for healthcare data sharing

Cody Sedler
9 min readOct 24, 2022

--

Thank you to Jonah Leshin, Head of Privacy Research at Datavant, for helping to author this post & Jake Plummer, Head of Customer Success at Datavant, for his review/feedback.

I’ve worked with dozens of companies entering the health data space — each requiring a custom approach to privacy based on their use case, datasets, company makeup, etc. Each of them, however, requires a fundamental understanding of the landscape of healthcare data privacy space.

The growth of the real-world data industry has created and accelerated an industry of organizations providing de-identification/privacy-related products & services. In general, there are several ways to operationalize the de-identification (e.g., redaction, aggregation, hashing, etc.) — which is not the focus of this article.

At a high level, this article intends to outline some privacy considerations for companies starting to review their health data strategy.

It’s worth noting that there are immense use cases for identified data (when patients have consented, under TPO, etc.), but this is focused on use cases around the de-identified health data space.

Overview of the de-identified real-world data industry

The real-world data industry is a strong and growing field fueled by the exponential increase of health data collection and sharing.

De-identified real-world data can help to understand patient outcomes on particular therapies, improve clinical decision-making, optimize clinical trial decision-making, and countless other use cases. Hundreds of companies are building data products, analytical offerings, and services on top of these use cases.

These real-world data use cases rely on healthcare organizations that have de-identified data rights through their work with hospitals, payers, pharmacies, and other organizations that interface directly with the patient. However, some companies may be twice removed (i.e., a claims aggregator may work with a clearinghouse, which works with the hospital to process claims).

As the landscape & use cases of de-identified real-world data continue to evolve, the need for privacy-preserving technology & services must follow suit to protect patient privacy and enable compliant data exchange. Use cases for real-world data will continue to require more robust patient data, pushing the envelope for privacy technologies.

Under HIPAA, there are two ways to share de-identified data; Safe Harbor and Expert Determination.

Under Safe Harbor, all 18 personal identifiers must be removed (i.e., Patient Names, Dates, record numbers, etc.). Expert Determination, on the other hand, requires an expert’s statistical analysis to certify that there is a very small risk of re-identification of the data. The dataset will be considered “de-identified” once the risk of that re-identification has been determined to be very small.

Safe Harbor requires several fields to be redacted, which often limits the utility of the underlying datasets. Researchers and analytics teams often use the Expert Determination method to retain certain critical fields. For example, suppose a researcher is looking to understand 30-day readmission rates by diagnosis for a certain patient population. Safe Harbor would require the removal of any dates more granular than the year — limiting the ability to run the analysis.

Expert Determination may be considered a more favorable approach since it lays out the operations required on a dataset to bring it to an acceptable threshold of re-identification risk. whereas Safe Harbor unconditionally requires the removal of 18 identifiers without contemplation of the underlying risk of any particular dataset.

Two methods to achieve de-identification in accordance with the HIPAA Privacy Rule.

Expert Determination “as a Service”

Again, there are countless use cases for real-world data within the healthcare and life science space — you can read up on the trends here.

Several use cases require de-identified datasets to be shared with external stakeholders. Given the fragmentation of health data within the US, data & analytics vendors will typically need to collaborate across the industry to have a complete understanding of the patient.

Meeting the requirements under Expert Determination typically involves working with an external vendor with expertise in health data, although some larger organizations may have internal compliance experts.

These organizations perform statistical analysis to understand the risk of re-identification of the underlying data. Organizations like Privacy Hub by Datavant (I work in partnerships at Datavant) & Privacy Analytics by IQVIA are the leading commercial organizations in the space.

These organizations produce an in-depth report outlining the dataset’s underlying risk & required transformations to the dataset to mitigate that risk. The service provider, the data steward (owner or licensee doing a data linkage), or another 3rd party will perform the necessary alterations/remediations to the dataset. The initial risk can then be measured again, and if the risk is below the predefined risk threshold, the dataset will be determined “de-identified.”

Automating aspects of the Expert Determination process (e.g., creating automated data validation checks) will streamline & reduce the time it takes to perform the necessary analysis. Still, there is a requirement under HIPAA that an “expert” be involved in the certification process, so this process cannot be fully automated.

Since Expert Determination doesn’t provide strict guidelines (e.g., “very low” risk is not numerically defined in HIPAA), the industry has relied on thought leadership and academic literature to establish best practices in this space.

Thought leaders such as Dr. Daniel Barth-Jones, Dr. Colin Moffatt, Dr. Patrick Baier, Dr. Khaled El Emam, and Dr. Brad Malin offer consultative services, and have published academic papers that have helped define the standards of Expert Determination risk.

Dr. Brad Malin leads Privasense, a health data privacy consultancy, and Dr. Colin Moffatt, Dr. Daniel Barth-Jones, and Dr. Patrick Baier have joined the Privacy Hub team at Datavant. Dr. Khaled El Emam is the founder of Replica Analytics, focusing on synthetic data technologies.

Typically the largest aggregators/analytics companies within the real-world data space are leveraging Expert Determination services or internal teams to de-identify their datasets for aggregation, analysis, or data sharing.

Synthetic Data technologies

Synthetic data is artificially created data replicating the source data’s statistical characteristics.

A compelling use case for Synthetic data technology is when the respective dataset & research questions are too granular to reach the threshold guidance of the certifier.

For example, suppose a researcher is looking to understand better any correlation between oncology patients’ survival rates and socioeconomic factors. Several public sources of mortality data, with varying degrees of accessibility, include the Social Security Administration’s Death Master File & public obituary records. According to HHS, PHI linked to public or accessible datasets can introduce greater risk — since it can be exploited by anyone who receives the information. With that in mind, introducing the patients’ clinical information with socioeconomic and mortality data could hypothetically introduce “too much” risk to the dataset.

Generating a synthetic version of the dataset could enable the researcher to analyze the dataset with the same attributes and statistical properties as the source dataset but without the risk associated with the “real” dataset. Furthermore, it can offer advantages such as the decreased risk of over-sharing real data or speed to analysis (data synthesis is an automated process). There are considerations, however, as synthetic data standalone has not been widely accepted for use in regulatory submissions.

Synthetic data organizations seem to focus on a couple of different business models. One use case is leveraging synthetic data technologies to de-identify datasets. Replica Analytics, now a part of Aetion, focuses on offering its product as a means for compliant data exchange & de-identification. Founded by Dr. Khaled El Emam, Replica has published an overview of synthetic data generation for implementing the Expert Determination method.

Other organizations in the space, such as Syntegra & MDClone, focus on leveraging synthetic data technology to provide data or analytics based on the model trained off of real-world data. Syntegra offers an API that gives users access to synthetic claims/EHR records for analytics. MDClone offers its ADAMS Platform to allow users to create their own synthetic datasets for use for analytics purposes.

This data could train AI models, map patient journeys, conduct natural history studies, etc.

This might be a bias, given that Datavant’s focus is around connecting the world’s health data, but it’s worth noting the difficulty of finding a single dataset with a 360-degree view of the patient. The completeness of this “synthetic data” hypothetically relies on the fidelity of the training dataset, and biases/missingness of that source data would likely carry through to the model.

It’s also important to understand the properties of the synthetic output in order to determine whether it can be used to answer the intended research question.

Working with Unstructured Data

Unstructured data typically refers to free-form text fields in EHRs / clinical notes, lab reports, images, etc. While countless use cases would benefit from the accessibility of unstructured data, it’s often challenging to de-identify the unstructured text. Since unstructured data implies the dataset is not in a typical delimited format, the statistical analysis needed for Expert Determination review is challenging to perform.

As a result, the data owner has several options for de-identifying the dataset. First, some companies like John Snow Labs or Mendel offer an NLP service to identify the 18 attributes in the Safe Harbor method and redact. Alternatively, NLP/AI could be used to abstract information from the free text to append to a structured dataset.

Additional privacy technologies

There are quite a few other privacy technologies which could play a role in privacy-preserving data sharing. Travis May, the co-founder of Datavant, has written a great article on the advantages and challenges of certain privacy-preserving technologies in healthcare — like Differential Privacy, Federated Learning, Multi-Party Computing, and Homomorphic Encryption. I’ll define a few of the other technologies, but I’d defer to more holistic summaries & pros/cons — here.

Differential Privacy involves adding statistical noise to the data so that the patient’s privacy is protected, but patterns in the dataset remain.

Federated Learning involves a decentralized approach to model development that doesn’t require the aggregation of patient data into a central repository. Instead, the model generation can happen locally to the data source — like at the hospital itself. Companies like Owkin and Rhino Health are building products around federated learning within healthcare.

Like Federated Learning, Multi-Party Computing allows for analysis across distributed datasets without each party needing to share its data with the others. Multi-Party Computing leverages multiple parties to answer a question jointly while keeping each input secret from the other parties. TripleBlind is building MPC offerings within the healthcare space.

Commercial Considerations

Ensuring patient privacy is as much of an ethical obligation as it is legal. There are hypothetical differences in risk, complexities, and feasibilities between each vendor. It would likely require privacy & statistical experts to fully evaluate the risks and benefits of each to decide which ones an organization should implement.

On top of privacy, logistical, and security considerations, commercial questions need to be answered regarding data sharing.

A few examples:

Example #1: a real-world data provider wanted to share de-identified health data with their life science customers. That provider could leverage data synthesis (i.e., creating a synthetic version of the data) to enable rapid data sharing with potential partners. They may, however, limit the types of use cases it could enable, assuming those additional use cases needed to link that data with complimentary datasets (e.g., linking lab data with closed claims).

Example #2: an oncology analytics vendor (let’s name that vendor… Onco-Analytics) needs the insights that come from the free text of an EMR and may need to focus on technologies around advanced NLP/AI models to better extract attributes from unstructured clinical data to build structured datasets & de-identify them using the services/technologies mentioned above. Onco-Analytics could also consider leveraging synthetic data technology should the dataset become too robust to de-identify under traditional Expert Determination implementation methods.

Example #3: perhaps in a similar scenario to #2, Onco-Analytics needs data from inpatient hospitals & academic medical centers — and these organizations may be reluctant or unable to share data. Onco-Analytics may look to install Federated Learning / Multi-Party Computing technology at the site of the data. The data would not need to leave the 4-walls of the hospital, and the analysis can happen in a distributed way.

There are countless scenarios real-world data can be accessed and/or used, so the privacy technology must fit the organization’s respective commercial and operational considerations.

--

--