Securing Life Sciences Data with Snowflake and Skyflow Vaults

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

10 min readApr 16, 2024

By Sean Falconer from Skyflow and Harini Gopalakrishnan in collaboration with Emma Gentry and Lisa Arbogast

Note: There are many solutions that fall under the realm of data protection and this blog is one such approach with a Snowflake partner called Skyflow , a data privacy vault provider, with examples of how it can be leveraged in situations that especially require data residency due to geographical restrictions.

Context

In the rapidly evolving technological landscape of health and life sciences with focus on AI/ML and LLMs, the management of vast amounts of identifiable information becomes extremely important. Pharmaceuticals generate copious amounts of identifiable data through activities such as clinical trials and patient engagement and face stringent guardrails concerning the “use and access” of this data. This analysis of data captured as part of a routine study that is backed by consent and a well described intent of use statement is called primary analysis. The data, once anonymized or stripped of identifiable information, can be used for other inferences to drive insights and this we call secondary analysis. Navigating this data landscape for both primary and secondary analysis has remained complex in spite of technological advancements due to two main reasons:

Restriction of data access based on geo-location to comply with data residency requirements including stringent privacy regulations that mandate specific practices for governance, secure storage, and usage
Manual effort involved in anonymizing data to unlock analysis thus becoming a productivity bottleneck

These factors become a hurdle to progress in the industry which prevents the unlocking of insights that have been hidden in the institutionalized data collected over the years.

This blog post introduces a solution combining the Skyflow Data Privacy Vault with Snowflake to address two major challenges in the health and life sciences sector: complying with data residency laws and streamlining the data de-identification process for analysis.

What is Skyflow and how Does It Help?

Skyflow is a data privacy vault that isolates, protects, and governs access to sensitive customer data (i.e. PII, PHI, and PCI data). This data could be structured or unstructured data. Sensitive data is isolated and protected by the vault, while non-exploitable tokens that serve as references to the sensitive data can be stored in Snowflake and other downstream systems. A token is an obfuscated string that represents other, more sensitive data. You can think of tokens as “stand-ins” or “pointers” for the actual, plaintext sensitive data, such as a social security number or a passport number, which is stored in the vault. Much like an identity provider offers authorization and authentication services across your stack, Skyflow Data Privacy Vault offers PII security, management, and use across your entire stack.

Due to HIPAA and PCI DSS, collecting, managing, and processing patient and payment data is challenging as this data must be handled with care across their entire tech stack, including their CRM, marketing automation tools, backend, and all data systems. Each independent system or service may provide some built-in support for compliance and data governance. Skyflow helps by providing a HIPAA and PCI DSS compliant shared service that transforms sensitive customer data into de-identified data. The de-identified data can be safely shared across all systems. The compliance scope is greatly reduced across the stack and analytical operations continue to work.

*Skyflow Data Privacy Vault Managing PII Across the Stack*

If data needs to be re-identified, the service that needs the data exchanges the de-identified data with the Skyflow service for a representation of the original data. Skyflow tightly controls access through a zero trust model where no user account or process has access to data unless it’s granted by explicit access control policies. The policies are built from the bottom up, granting access to specific columns and rows of PII. This allows the customer to control who sees what, when, where, for how long, and in what format (see image below).

*Showing Different Views of Sensitive Data Based on Role*

In the following sections, we cover two real world use cases that are impacting health and life science customers: data residency and de-identification for device analytics and look at how Snowflake with Skyflow can be used to address these challenges.

Scenario 1: Data Residency and Global Analytics

Operating globally poses challenges for organizations, particularly in conducting global analytics due to data residency restrictions, where data must remain within specific regions to comply with regulations, hindering the ability to run queries at a global level.

Skyflow provides a way to store regulated data in vaults specific to each region. The vaults transform the regulated data into de-identified data, no longer subject to regulation, that can be stored and processed from one centralized Snowflake account as well as used by other customer applications. This approach supports worldwide analytics and compliance, enabling companies’ analytics teams to work from anywhere. Snowflake users can work directly within Snowflake, handling data without accessing the regulated information, even in areas where Snowflake is not available in that region or has limited support, such as China and South Africa.

Example Customer Use Case

Consider a CPAP machine manufacturer who collects sleep data globally, using IoT devices to gather personal health information. Due to local laws, the data must be stored and processed in the same region where the customer is located, which hampers the company’s ability to fully take advantage of Snowflake to analyze data on a global scale.

Skyflow helps by allowing the company to deploy region-specific vaults to keep regulated data within the region, de-identify it, and load their Snowflake account with de-identified data. The shared service of the vaults transform sensitive customer data in such a way that it satisfies compliance regulations, doesn’t break analytical operations, and supports fully encrypted operations through Skyflow’s proprietary polymorphic encryption.

In the image below, data is being collected in both Germany and China, while Snowflake is running within the AWS us-west-1 region. The identifiable information such as name and email has been de-identified. The vault-generated tokens representing fields like name and email are consistently generated so that analytical operations like joins, group bys, and counts still perform as expected.

*Global Analytics with Snowflake and Skyflow*

Access to re-identify data depends on the country of origin where queries are executed. For example, consider the query below that selects all data from the patient table that contains JSON objects representing patients and captured CPAP data. A UDF named skyflow_reidentify is used to transform the vault-generated tokens into redacted, masked, or plaintext values depending on the permissions associated with the person executing the query.

SELECT
skyflow_reidentify(Y.value:"name"::VARCHAR) AS name,
skyflow_reidentify(Y.value:"email_address"::VARCHAR) AS email_address,
skyflow_reidentify(Y.value:"country"::VARCHAR) AS country,
skyflow_reidentify(Y.value:"date_of_birth"::VARCHAR) AS date_of_birth,
Y.value:"date"::VARCHAR AS date,
Y.value:"usage_hours"::VARCHAR AS usage_hours,
Y.value:"cmH20"::VARCHAR AS cmH20,
Y.value:"leak_rate"::VARCHAR AS leak_rate,
Y.value:"ahi"::VARCHAR AS ahi,
Y.value:"events_per_hour"::VARCHAR AS events_per_hour,
Y.value:"oxygen_saturation"::VARCHAR AS oxygen_saturation,
Y.value:"mask_fit"::VARCHAR AS mask_fit
FROM patients, LATERAL FLATTEN(input => var) Y;

An analyst in China might see the following representation of the records.

*Example of Results when Query Executed from Within China*

Alternatively, an analyst in Germany sees the following.

*Example of Results when Query Executed from Within Germany*

And finally, someone in the US sees all identifiable data as non-exploitable vault-generated tokens.

*Example of Results when Query Executed from Within United States*

All data and operations can be localized, while global queries and analytics are still possible using the de-identified data.

Scenario 2: Personalized De-Identification and Re-identification for Analytics

As mentioned prior, in life sciences, customers often gather personal information with consent through various methods and this is classified as primary data collection. One such important data is called patient engagement.

Patient engagement can be defined as the interactions between healthcare companies and patients in the hopes of providing more positive experiences that result in higher standards of care for patients with improved satisfaction and outcomes.

This article offers a clear overview of how patient engagement data is important for pharmaceuticals in unlocking new insights with respect to care management and measuring treatment effectiveness. These data points can be through websites or mobile apps with sufficient guardrails to ensure appropriate consent has been obtained for data collection and that the details are securely stored. The goal of collecting data is to gather information on specific health conditions, monitor health indicators, and send medication reminders, such as insulin alerts. With the rise of digital tech, patient engagement has become even more pivotal. Anonymization of this data and making it available in a centralized data lake can unlock other use cases for pharmaceuticals beyond its intended collection need. For example, the engagement information in conjunction with other sources like EHR or claims, can help identify when patients stop following a prescribed therapy or medication or the effectiveness of a line of therapy with respect to the standard of care in the market.

Skyflow helps by anonymizing the data, regardless of source, while still allowing it to be useful for data analysis. Life science customers can use Skyflow to transform regulated data into non-sensitive de-identified data to help eliminate data duplication across Snowflake and other upstream services. Instead of duplicating PII, Snowflake and all other systems that touch PII keep a de-identified vault-generated token that acts as a stand-in for the original value. The vault ensures the data is stored securely and access is controlled through fine-grained access policies, adhering to privacy and security standards.

Example Customer Use Case

Let’s consider a hypothetical scenario where a pharmaceutical company is conducting research to understand the impact of a new medication on the physical activity levels and health signals of patients with a specific medical condition. To gather relevant data, the company decides to collect information from wearable devices such as fitness trackers or smartwatches worn by study participants, collecting data such as sleep patterns, heart rate, and steps throughout the day.

Patients using the new medication, along with established drugs, share their experiences via a web application. The pharmaceutical company aims to combine this feedback with tracker data to identify which drug receives the most positive reviews and reactions from participants.

Here’s how data anonymization could be crucial in this context:

Study participants wear wearable devices that track various metrics such as steps taken, heart rate, and sleep patterns. These devices continuously collect data throughout the day.
Before the collected wearable data and web application data can be used for analysis, it needs to undergo anonymization. This involves removing any PII associated with the participants, such as names, addresses, or contact details.
Any unique identifiers present in the wearable data, such as device serial numbers or user IDs, are replaced with randomly generated codes or pseudonyms. This ensures that individual participants cannot be identified based on the data alone.
In order to further anonymize the data, individual data points may be aggregated or grouped together. For example, instead of recording the exact number of steps taken by each participant at specific times, the data might be summarized as daily or weekly averages for each participant.
Throughout the anonymization process, strict measures are implemented to safeguard the security and integrity of the data. This includes encryption of data during transmission and storage, as well as access controls to limit who can view or manipulate the data.

Once the sensitive data has been anonymized, it can be safely used for analysis by researchers. They can explore correlations between medication usage and changes in physical activity levels, helping to assess the effectiveness of the medication and identify any potential side effects.

*Anonymizing Data to Snowflake from Various Sources with Analysis Against Anonymized Data*

For example, in the image above, data is anonymized from both the wearable and web application. De–identified data, aggregated data, and non-sensitive data is stored within Snowflake. Researchers are able to carry out analysis, like determining the performance of the drug based on a mixture of results like physical reactions and qualitative data collected from participants through the website without access to any PII.

By anonymizing data through Skyflow’s de-identification services, life science companies can analyze treatment patterns, medication adherence, and derive other valuable insights without accessing personal information directly. This approach ensures compliance with data protection regulations and ethical standards, fostering trust and confidence in the research process.

Key Takeaways

The combination of technological innovation and regulatory complexities underscores the critical need for robust solutions that ensure compliance, data security, and seamless data access. As a summary, here are the key takeaways from this article.

Data Management in Pharma: In the health and life sciences sectors, especially within pharmaceuticals, managing vast amounts of patient data — ranging from clinical trials to patient engagement activities — is crucial. This data, collected with patient consent, needs to be handled with care to ensure privacy and compliance with strict data regulations.
Navigating Data Privacy Challenges: Two major challenges in data privacy include adhering to data residency requirements across different regions and the manual effort required in data anonymization. These hurdles can impede the industry’s progress by limiting the analysis and insights that can be derived from collected data.
The Role of Anonymization: Anonymization of patient data is essential for secondary analysis, which can unlock valuable insights for the industry. However, the process, especially for legacy data, is often time-consuming and expensive, leading many companies to underutilize valuable data assets.
Skyflow’s Solution: Skyflow offers a Data Privacy Vault, providing a solution for secure data anonymization and management. This tool helps in complying with data residency laws and simplifies data de-identification, enabling more efficient and compliant data analysis.
Innovation and Progress: The collaborative approach of Skyflow and Snowflake not only streamlines data governance but also fosters innovation and drives insights in the health and life sciences sector.

Conclusion

There are many ways to perform privacy preservation and data residency and this blog outlines one such approach with a solution by Skyflow. The adoption of tools like Skyflow, which assist in navigating the complex landscape of data privacy and anonymization is useful for unlocking the full potential of data in health and life sciences. This approach not only ensures compliance with stringent privacy standards but also paves the way for valuable insights that can advance patient care and treatment outcomes.

Securing Life Sciences Data with Snowflake and Skyflow Vaults

Context

What is Skyflow and how Does It Help?

Scenario 1: Data Residency and Global Analytics

Example Customer Use Case

Scenario 2: Personalized De-Identification and Re-identification for Analytics

Example Customer Use Case

Key Takeaways

Conclusion

Written by Harini Gopalakrishnan