How Life Science companies are optimizing the analytics with Snowflake

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

7 min readApr 8, 2021

The volume of data in healthcare is projected to grow a whopping 36% annually over the next five years — faster than in any other major industry. Healthcare and life sciences organizations can gain significant benefits by unlocking the value of their data. They can deliver quality care outcomes by building comprehensive patient views, facilitate deeper health insights with secure and seamless data sharing, drive innovation with large-scale genomic analytics, reap the benefits of the growing IOMT, and build the next generation of healthcare applications. By embracing a cloud data platform. Snowflake’s Innovative technology is enabling life sciences companies to unlock the value of data.

I have had an opportunity to work with several customers to implement their roadmap using Snowflake, especially in the area of commercial data, Managed markets, Gross to Net, RWE, Genome, and COVID 19 data.

Through this post, I would like to share my learnings regarding how Snowflake features can be best utilized to address some of the use cases, best practices, and key considerations to the Architect community. Think of a scenario where a person won 10 million dollars in a lottery and still needs a guide on how to best spend/invest the same. This post makes an earnest attempt to bridge that gap. 😃

Note: This post does not cover how to use Snowflake features as we have tons of information in the Snowflake documentation.

Reference Architecture:

Data Exchange Use Cases:

Data scientists would like to get better insights (ex: next best customer, channel, etc.). I need the ability to quickly integrate external (ex: weather, COVID19, RWE) and Internal data ( ex: sales and claims, etc.).
How do I improve my time to market for product launches?
How do I reduce time to do the data ingestion from data aggregators ( Usually involves ETL, SFTP, QC, and Reconciliation)?

Snowflake Solution: Data Exchange:

Snowflake has changed the dynamics of this operating model by providing Data Exchanges. Case in point data analysts can import COVID 19 share within a matter of minutes and hence can focus on the core task of delivering deeper insights. A data aggregator (ex: IQVIA) shares the data as a Snowflake share, and Pharma IT consumes the same within minutes, thereby saving at least 25% of ingestion costs and resulting in quicker data availability to end-users.

Solution Considerations:

1. Though Snowflake simplifies the data exchange, it does not replace internal security approvals—plan for the same.

2. When you are sharing or consuming the data think about data monetization and who will pay for the reader account etc.? Especially when exchanging the data with a different cloud provider or to a customer in a different region.

3. Pay attention to the security aspects of Snowflake share. There are differences in the level of security that you can offer on a regular database vs. share [ ex: inability to provide them access to individual objects in a share. All the objects that are shared will get access to all the users].

4. Read the documentation for some of the understandable watch-outs (ex: we need to use only secured views) so that logic is protected. Consider this while you are planning the data models.

Virtual Warehouse Use cases: Snowflake Solution:

Snowflake’s innovative technology ( Virtual Warehouses) solves this problem by separating compute and storage. Create different warehouses for different persona or different departments. We can customize the compute and scaling requirements based on the need.

Solution Considerations:

Clone Use cases:

The following are some of the use cases which need a lot of ETL and QC checks.

Snowflake Solution:

Snowflake’s built-in clone feature lets you create a static copy of the database table schema without incurring any additional storage cost. It happens in real-time, as the name says Zero Copy Clone. Some of these use cases can be solved by using TT or Clones.

Solution Considerations:

A cloned table is a static table. It will not be updated when the main table gets updated.
For incremental data, programmatically update the clone database. It is recommended to prefix the table name with As of date for easier understanding.
Refrain, if possible, from executing DML transactions on the source object (or any of its children) until after the cloning operation completes.
Individual external named stages can be cloned; internal named stages cannot be cloned.
When a database or schema that contains source tables and streams is cloned, any unconsumed records in the streams (in the clone) are inaccessible. This behavior is consistent with time travel for tables. If a table is cloned, historical data for the table clone begins at the time/point when the clone was created.
When a database or schema that contains tasks is cloned, the tasks in the clone are suspended by default. The tasks can be resumed individually (using Alter task, resume).

External Tables Use cases:

Amount of data that a typical commercial IT department stores is increasing rapidly. Architects usually get the below questions from the data scientist community:

1. I have a lot of Clickstream data related to multi-channel marketing. Is there a way I can view and understand the data ( talking about TB of data)?

2. I am planning to change data aggregators (ex: from IQVIA to Symphony). Is there a way I can quickly analyze the data?

3. I have a lot of data on the archive. What is the best way to make the data available to end-users on occasions of Audit etc?

Snowflake Solution: Snowflake offers lake warehouse architecture which lets you query the data directly from your AWS or AZURE without really loading the data and also combine the data with existing warehouse data.

Solution Considerations:

i. After uploading a new file in the S3 location, we need to run << Alter external table>> to refresh the external table with new file data (if old file exists along with new file then data from both files will be shown).

ii. Avoid using the external tables if the volume is huge.

Time Travel Use cases: Snowflake Solution :

Time travel is an innovative concept from snowflake with which users can go back to a snapshot of data up to 90 days old. Following are the high-level use cases that users are being benefited from (without any coding)

Solution Considerations: Tasks, Streams, and Pipes Use cases

How can I Provide self-service analytics to users?
How can I automatically ingest the streaming data without worrying about coding, reliability, scalability, and agility?
I have many super users? Is there a way I Can schedule my jobs?
How can I detect changes to data without really doing a lot of coding? (CDC)

Snowflake Solution:

It allows end-users to schedule several ETL functions ( SQL statements clones and stored procedures) which is truly putting the power of scheduling into the hands of end-users.
Tasks can be combined with table streams for continuous ELT workflows to process recently changed table rows. Streams ensure exactly one semantics for new or changed data in a table.
Snow pipe is Snowflake’s continuous data ingestion service. Snow pipe loads data within minutes after files are added to a stage and submitted for ingestion.

Solution Considerations:

There could be multiple reasons tasks/snow pipes may not perform how it is intended ( SNS Subscription issues incorrect SQL warehouse size, authentication issues, etc). It is always recommended to put together a process to capture monitor and ensure restartability. Snowflake offers extensive metadata to capture these ( Task history and copy history are a few examples of tables).
There is a 60-minute default limit on a single run of a task. This limitation was implemented to safeguard against non-terminating tasks. Query the task history table function. If the task was canceled or exceeded the window scheduled for the task, the cause is often an undersized warehouse.
Snowflake uses file loading metadata to prevent reloading the same files (and duplicating data) in a table. Snow pipe prevents loading files with the same name even if they were later modified (i.e. have a different eTag).
Snow pipe will automatically load the data after 14 days if the file name is the same ( modified data), potentially resulting in duplicate records in the target table.

Security Use cases: Snowflake Solution:

Snowflake provides Industry-leading features to ensure the utmost security to your data and users. It has PCI DSS, SOC1,2 Type 2 Compliance, PCIDSS, and HITRUST CSF Compliance. Data is encrypted at rest and in transit.

Solution Considerations:

Enable Private links to secure the traffic going through the Internet
Restrict the access to your accounts through only valid IP addresses
Enable MFA and SSO
Snowflake has strong role-based security and snowflake role hierarchy allows us to customize the security which is unique to snowflake
Some of the Dynamic Security features ( enabling TPA Access) needs coding to ensure that security role/grants are updated dynamically.
If your Account is being logically separated for Dev, SIT, and Prod. Create specific DBA and security admin roles so that principles of multi-tenancy are achieved.

Others:

Use Temporary and Transient tables where necessary to avoid Storage costs.
Control costs and avoid unexpected usage besetting up resource monitors.
Create Functions to answer frequently asked customer questions and to invoke the programs that are written outside of snowflake.
Conduct periodic health checks (offers many metadata queries) to ensure that warehouses are correctly sized, no queries are getting queued, look at the options to change the clustering key, remove unused tables and users.

Originally published at https://www.linkedin.com.

How Life Science companies are optimizing the analytics with Snowflake

Written by yadukishore tatavarthi