How Pfizer Achieved Self-Service Data Mesh with Snowflake and Azure

Matt Massey

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

20 min readApr 11, 2024

Introduction

By: Adam Neel, Sales Engineering Manager @ Snowflake

In recent years, data mesh has been a hot topic in the data world. While many organizations aspire to enable true self-service, it often becomes an aspiration vs a reality. In this article, we will be hearing from key contributors who have achieved enterprise grade self-service data mesh on Snowflake and Azure.

The foundations of data mesh are predicated on four main principles:

Domain-oriented ownership
Data as a product
Federated computational governance
Self-service data platform

Sounds easy enough, right? However, the first step in this journey requires definition of these principles in the context of your tech stack. For example, the idea of data as a product and definition of Data Products can be ambiguous and open-ended. In addition, these definitions can vary greatly between enterprises.

As the Data Strategy, Science, and Solutions (DSS&S) team at Pfizer grew from a small but mighty data team to a force of 80+, these were the concepts that needed to be defined in order to scale the business. Let’s dive into their story, and how they approached the data mesh journey.

Project Goals

It is important to first understand the program’s goals which led DSS&S to implement a data mesh architecture:

Consistent patterns and standards
Automated data integration
Core data management
Integrated file and data quality checks
Granular security
Improvements to delivery speed and sustainability

With this charter in mind, DSS&S identified six core changes which would be required to scale the strategy:

Separating of Snowflake Dev/Test/Prod accounts — Prior, DSS&S was managing their different environments logically and physically through RBAC and Database object separation in a single tenant. In the new world, there would be separate Snowflake Accounts per environment.
Reorganizing the Snowflake Topology — They needed to shift to granular separation of Data Products, and decided to isolate Data Products at the Database level.
Creating Domain dedicated compute nodes — This required automation of deployment and domain specific warehouses (Snowflake compute nodes) as a component of the data product.
Revamping Role Based Access Control — They implemented data product specific roles at the DB/Schema level and leveraged future grants to automate the provisioning around views.
Increasing the integration with Azure — Many of the automations would be leveraging services hosted in their organizations Azure tenant. As such, there would need to be tighter alignment with Azure subscriptions for ingesting data from Azure Data Factory as an example.
Building CICD Automations — Seagen leveraged GitActions and Terraform to automate the deployment for their data product templates

So, this brings us back to the earlier question: What defines a data product? For DSS&S, this:

Data product definition would be the new topology of their environment consisting of: Ingest pipelines (ADF) to Snowflake, dedicated databases per data product, a standard set of schemas within, RBAC schemes for authorization, the underlying database objects (tables, views, functions, etc.), and dedicated Snowflake compute resources to avoid resource contention and enable price governance at the product level.

Or, in summary:

Let’s zoom in on what these Data Products look like in the Snowflake world. At its core, the data product is automated ingest into a landing schema (Raw), where data is then processed through a few schemas throughout the lifecycle. These include: PostDQ (DQ rules applied if relevant), Curated (the first landing where business rules are applied in the domain’s context), and lastly Published (the consumption layer for end user access).

Data Product in Snowflake

I’d be remiss if I didn’t mention the role of Snowflake’s Data Collaboration features in this story. Consider an example where a data product is re-using or leveraging a subset of dimensions contained within another data product. Should this new data product re-ingest and process this dimension within the scope? This would lead to data redundancy and nebulous processing, which we always want to avoid.

There’s the old saying: “A person who wears two watches never knows what time it is”. I often remind clients of this adage, as it is just as relevant to building sustainable data architecture. Instead of managing the lifecycle of this data set in many different Data Products, we should instead leverage Snowflake data sharing to provide secure, virtualized access to this data, without redundant storage.

Data product leveraging Snowflake Data Sharing

Alright, enough of the high level conceptual talk. Now that we have set the stage, let’s hear from the key contributors on the DSS&S team to deep dive into the underlying technologies that power the solution.

Automating CICD via Azure, Git, and Terraform

By: Christopher Witcher, Principal Platform Engineer @ Pfizer

Introduction/Stage Setting

Early on in the development of the Self Service Data Mesh, two questions were asked. The first question was what does it mean to be Self Service. The second being how far can we move the needle of Self Service to provide Data Engineers the autonomy to develop and build their own Data Products. Our organization already leveraged Snowflake, Azure, Terraform, GitHub and GitActions for various use cases. However, expecting our Data Engineers to simply use these tools and platforms together with little to no experience in all areas was not a workable solution.

The answer became clear to us early on that we needed to deliver a solution that could hand Data Engineers a Data Product they could develop on in minutes. Anyone who has had to create an Azure Data Factory via the Azure Portal will tell you it takes a few minutes to provision, and then more time to set up your repo, and then you still need to setup your Snowflake database with roles and grants. It is already sounding like a lot, when as Data Engineer you want to work with data and create pipelines, not the many laborious tasks involved in the set up.

Architecture

GitHub Repository Structure

Terraform Modules

Snowflake Module
Data Product Module

<Data Product Name>-iaas GitHub repository

Terraform Cloud to deploy

Shared Snowflake Code Repository

Is referenced as a submodule from all code repositories

<Data Product Name>-code GitHub repository

ADF folder will deploy to Data Factory Code
Snowflake folder which leverages SchemaChange to update Snowflake.

Defining the infrastructure of a Data Product

When we give our Data Engineers a Data Product we don’t just have a Database or just Azure Data Factory. We wanted to give them a pre assembled product, which they could start using immediately. The *-iaas repo deploys the Snowflake infrastructure, Data Factory, a service account for Snowflake, and Azure Key Vault. Also, we wire up the data factory to the Snowflake database. The database comes pre-loaded with a “Hello Database” Pipeline, which after the first deployment in the *-code repository will work out the box. This has given our Data Engineers a working example to get them started.

Git Branching vs. Data Product Environment

One of the requirements we had to adhere to was our company policies which required us to have segregated development, staging, and production environments. Both the *-iaas and *-code repos follow the same structure. GitHub has three branches for Continuous Integration Continuous Deployment (CICD), which are “development”, “staging”, and “main”. Then, the Azure Environment is separated into different subscriptions, which are “dev”, “stg”, and “prd”. Finally, Snowflake is segregated by having different accounts for each environment which are “DEV”, “TEST”, and “PROD”.

Integration between services details

In order to understand the integration of Azure and Snowflake, we must understand how Terraform handles this crucial step. Let us start with what is at the center of all our automation, GitHub and GitActions. So let us look at the structure and automation of both *-iaas and *-code repositories.

IaaS Repo Structure

The infrastructure is provisioned using Terraform, which is an essential tool for any DevOps engineer working in the cloud. We are going to start with looking at the Data Product Module and the Snowflake Module. These are created within our organization’s private Terraform registry. This is referenced in the main.tf of the *-iaas repository.

Data Product Module
In this example, we provide our users with 10 lines of code which are customized for their Data Products. Each data product provisions and configures a key vault and Azure Data Factory. It also has the bonus of provisioning the Snowflake Database below which is associated with the Data Product.

Snowflake Module
As stated above, this Snowflake module is called to provision a Snowflake database for each Data Product. This module configures everything a Data Product owner needs to get started. It provisions a service account for Azure Data Factory to use and stores its password in the key vault as a secret. Then it creates a warehouse for the Data Product as well as a default set of schemas for engineers to leverage. Then it will provision any roles and grants that are required for each Data Product to enforce least privilege on the database.

Code Repo Structure

Image: Interactivity flow between Terraform, Snowflake, and GitHub to achieve automation.

Lessons learned — challenges/solutions

Data Mesh and Self Service is a complicated concept. There are different solutions to move the concept into an architecture and implementation. Three main principles we followed during implementation: first, treat Data Product development like application development; second, leverage GitHub Actions not only for CICD, but also any automation we may require up to and including using our own build server; third, treat all code as reusable where possible to become more efficient in developing. This empowered our Data Engineers to begin developing their Data Products within 15 minutes with just a few clicks.

Our biggest challenge was leveraging command line tools in Git Actions. Although Terraform has actions to leverage for deployment, it was challenging testing terraform deployments prior checking in code to the git repository; during development it took longer to get cycles out of deployments. Then, by using SchemaChange from our own build server we were able to overcome this barrier. We installed both SchemaChange and Python on the build server, which allowed for faster deployments of SchemaChange, instead of the repeated effort of installing these on every run.

Automating RBAC, Provisioning with SCIM and Terraform, and CICD with SchemaChange

By: Jennifer Yoakum, Senior Platform Engineer @ Pfizer

Introduction/Stage Setting

When the self-service project began, we were already in the process of refactoring the Snowflake security model for the 5-year running single-tenant solution (where environments remain separated by database). The team had grown very big, very fast, and had reorganized from a single centralized ingestion team, to many domain-oriented Squads. We needed to institute least privilege and role-based environment separation as soon as possible without impacting ongoing development, access, or analytics. The self-service project meant that we now also needed to design a second security model for the multi-tenant solution (where environments are separated by account) and we needed to ensure both security models would seamlessly work together in the production tenant (where the new production Data Products would exist side-by-side with the existing dev, test, and production data marts and stores).

Architecture & Integrations

Pattern-based RBAC automation

Over time, the database schema had become our primary security hook. Most of our database schemas are dedicated to providing the raw data from a single source system or vendor where full access to all views was typically considered appropriate by data owners. The database schema is also the closest correspondence to the concept of a Data Mesh data product that we found in our existing data stores. This led us to develop a convention for consistent Snowflake role naming that would self-describe the access provided and support automation (roles and grants).

Image: Single v Multi-tenant Snowflake Role naming conventions

In the image above, elements 1, 2, and 3 (database, environment, and schema respectively) combine to form end-user consumer roles. Roles like ODS_PROD_AAD or AAD_PROD_PUBLISHED provide use of the database and schema and select on views only. The single-tenant roles rely on future grants to ensure consistent access to these views even as modifications may be deployed. Elements 4, 5, and 6 are variations on the root; single-tenant roles always include database, environment and schema, and multi-tenant roles always include the database and environment.

The READ and WRITE variations (elements 4 and 5) are privileged for those who are responsible for maintaining the data product. Read roles like ODS_PROD_AAD_READ or AAD_PROD_READ expand access beyond just views; and in the multi-tenant roles this extends to include all schemas in the Data Product database — this is why schema is not part of the root of multi-tenant roles. Elevated roles like ODS_PROD_AAD_WRITE or AAD_PROD_WRITE serve as the owning role of all schema-level objects contained within the data product. Here, we rely on future ownership grants to avoid accidental conflicts since our data engineers have a default functional role corresponding to their Squad; these in turn inherit data access roles relevant to the Squad. Finally, for single-tenant Data Products, there is an SDS (Secure Data Share — element 6) database role variation. A role like ODS.ODS_PROD_AAD_SDS is used to make this data available to the new Dev and Test tenants to support reusability of data in the development of new Data Products.

While the multi-tenant roles are all provisioned by Terraform (see Snowflake + Terraform below), and the single-tenant consumer roles by SCIM (see Snowflake + Azure below), everything else is managed via Snowflake automation (SQL stored procedures and tasks). We leverage Snowflake metadata to power the automation logic. Information Schema and Account Usage provide everything we need to know about the existence of schemas in each database-environment, warehouses, roles, and grants. We combine this source data with the scope and status of expected warehouses, roles, and grants to produce the set of missing elements for automation to act on. Additional automation secures views for the database roles and grants these to the shares where they will be used.

Snowflake + Azure

In addition to the single-tenant consumer roles previously described, we rely on SCIM for user provisioning and the creation and management of functional (team or squad-based) roles. Azure SCIM provides two powerful advantages to the Snowflake security administrator:

Ensuring internal data is accessible only by active employees, and
Decentralized, and reusable functional access management

Azure SCIM enables us to leverage Entra (or Active Directory) identities when creating Snowflake users. This relationship between a user’s corporate network identity and their Snowflake account means that data access is revoked as soon as network access is disabled. This means less time spent cross-referencing staff departure lists against Snowflake user accounts.

Azure SCIM also translates security groups into Snowflake roles and security group membership into Snowflake user-role grants (both adding/granting and removing/revoking). Security groups are often created and managed by Service Desk technicians who are responsible for obtaining and documenting the necessary approvals required prior to adding new users to the membership. We leverage this by requesting the creation of one security group for each production database schema, and identifying the data owner as group owner. Similarly, we request the creation of one security group for each functional, team-based role we manage, and we set the team supervisor or lead as group owner.

These latter security groups (describing teams with similar data needs) are reusable by other systems integrated with Entra, including other Snowflake tenants, Azure resources (e.g., storage accounts, data factories, and key vaults), and in some cases on-prem resources too (e.g., file shares). This reusability is key to how we manage consistent access to all of the Snowflake and Azure resources Terraform provisions for each new self-service Data Product. At least eighty percent of a new team member’s access is handled by simply adding them to their Squad’s role.

Snowflake + Terraform

As described above, Terraform powers automated provisioning of new Data Products. Given a small number of simple inputs (e.g., the data product name, which in the image below is ‘DP’), Terraform takes care of almost everything else in Snowflake. The image below diagrams the structure of the new multi-tenant Data Products, specifically in the Dev tenant. The two green boxes at the top are security groups translated into Functional Snowflake roles by Azure SCIM, which we reuse in all three tenants. Similarly, the three data shares on the left are those production SDS database roles from the production tenant described previously.

Image: Diagram of the Dev tenant Data Product security model

Everything contained entirely within the gray box is provisioned by Terraform. This includes the new database, the standardized set of schemas, the warehouse, a service user (complete with Snowflake credentials saved immediately to Azure Key Vault), three roles, and most privileges needed to enable development. Also included, but not visible here, is a storage integration to a dedicated ADLS (Azure Data Lake Storage) Gen 2 location for staging files. Now that Snowflake is taking a more active role in the management of the Snowflake Terraform provider, we look forward to expanding our use of this functionality in the future.

Snowflake + SchemaChange (CICD)

SchemaChange enabled the environment separation and source control management we needed. The Snowflake service user created by Terraform during Data Product provisioning is the user SchemaChange operates through. This user shares use of the Data Product WRITE role — the same role used by the data engineers in the Dev tenant. In the Test and Prod tenants, only the Snowflake service user, through SchemaChange, has access to the role and can modify the Data Product.

SchemaChange is not what we thought we wanted initially. When you read through the available documentation, which is on the light side, we had significant concerns around source control and most of us who read it through were left with the mistaken impression that everything in the repository would be one of the three kinds of change script, perhaps based on the expected folder structure. Indeed, you often see the example script of a table definition that starts with ten columns but after some time acquires an eleventh, which is represented by an alter statement following the original ten-column table definition.

Our existing source control process always included two kinds of scripts — release scripts, yes, but also source scripts which presented the most up-to-date definition of any object. So, based on the previous example of a table gaining a new column, our script would start out as a table definition with ten columns and would subsequently be rewritten as a table definition with eleven columns. These source scripts were stored in a folder structure not unlike the way Snowflake depicts data navigation (database > schema > schema object type > object name). This served us very well. Not only did source control enable us to easily go back in time to see how any given object had changed but who changed it, when, and what else may have been part of that change. And should that table contain historical data that needed to be retained, the release scripts provided the freedom to shift data and manipulate objects as needed.

It turned out that SchemaChange was completely compatible with our preferred form of source control with an added option to ‘release’ a source script in place (the R__ type scripts which are well suited to view definitions). We can keep a source script in the correct location within the folder hierarchy and allow SchemaChange to recognize when it has been changed and ignore the script when it has not. It doesn’t always work though, because you cannot guarantee sequencing outside of the numbered V__ type scripts. So while SchemaChange took some adjustment, it is worth the assurance that only what has been approved in code review and stored in source control gets deployed to upper environments.

Lessons learned — challenges/solutions

Automating role-based environment separation and database cloning Back in our early days of using Snowflake, we developed a process by which we would regularly clone the production database-environment to create the test database-environment. When it came time to implement role-based environment separation in our single-tenant solution, we learned how roles acquire grants to all cloned versions. This means a role like ODS_PROD_AAD_READ which started with usage on the indicated schema, ODS.AAD, and select on all views and tables therein, following a clone now includes the privilege to use schema ODS_TEST.AAD and all views and tables there too. Initially, such a role also included use of the database ODS. And because cloning only keeps privileges below the cloned object, this was ok until we started looking at how to manage access to the test database-environments.

The straightforward instinct was to create and manage test roles just like for production and dev roles. But following a fresh cloning process, these test roles would be empty of all grants. It seemed inefficient to re-privilege these roles when the production versions — both access and elevated — already had the grants we needed. Combining these with a grant to the freshly-cloned test database was all that was needed to reinstate access to test database schemas. Unfortunately, the presence of a grant to the production database included with the schema access roles meant that test roles would have access to both production and test environments. Our solution involved the creation of dedicated database access roles and the removal of database grants from schema roles like ODS_PROD_AAD_READ. In this way, environment access could be controlled by the relevant database role combined with the production schema roles.

Terraform and Snowflake conventions
Terraform assumptions can contradict Snowflake conventions in unexpected ways. For example, Terraform-managed resources tend to be atomic designed to destroy or modify resources if they are found to not match the currently running definition; and this makes sense considering that Terraform specializes in infrastructure management and makes it easy to change and redeploy that infrastructure as needs change. So in hindsight, it makes some sense that when Terraform is asked to create a grant and it sees that it already exists, but not exactly how the current configuration is dictating, it destroys (revokes) the old in order to create the new.

In our case, we wanted the Data Product READ role to have the ability to monitor the Data Product warehouse. Upon deploying the first Data Product, everything looked exactly how we expected it. But after deploying the next Data Product, we found the first READ role no longer had the monitor warehouse privilege while the new Data Product READ role did. From a Snowflake standpoint this makes little sense since privileges like monitor warehouse are intended to be available to any role with the need and the approval to monitor any number of warehouses. If Terraform was deploying our Snowflake resources into different Snowflake Accounts, we likely would not have noticed. Thankfully, once we determined what was going on, the solution was found in an optional parameter initially overlooked: enable_multiple_grants and when set to true the behavior returned to Snowflake-normal.

Terraform and SCIM do not play well together inside Snowflake
Just as we maintained security groups for managing decentralized access to production schemas for individual users in the single-tenant solution, we intended to offer a similar pathway for decentralized access to production Data Products for individual users created via the multi-tenant solution. Ideally, Terraform would provide the grants between data and role and SCIM would provide the grants between user and role. Unfortunately, both start by attempting to create the role and neither support the use of CREATE IF NOT EXISTS, a powerful feature we use often in the automated RBAC procedures.

Powering Data Ingestion with Azure Data Factory and Data Consumption in Snowflake and Power BI

By: Matthew Massey, Senior Platform Engineer @ Pfizer

Legacy Data Architecture and Its Limitations
Our previous data architecture was primarily built around an on-premise server file management system, with Pentaho data integration used for processing data into Snowflake. This system allowed us to consolidate all our data sources into a singular Snowflake account, creating a unified, monolithic data repository. However, this approach presented several challenges.

Monolithic data stores often encounter scalability issues, struggling to efficiently store, process, and retrieve data as the volume increases. Real-time or near-real-time data processing is typically not a strength of such systems, leading to delays in data availability. They can also be rigid and inflexible, making integrations with other systems or data sources difficult and potentially requiring significant development effort. Accessing specific information from the consolidated data can be slow and challenging, especially as data volume grows. Moreover, the complexity of monolithic systems often results in high implementation and maintenance costs.

Transition to a New, Integrated Data Architecture
To overcome these challenges, we have adopted a new architecture that integrates Azure Data Factory, Snowflake’s cloud data warehouse, and Power BI. This powerful combination provides an efficient end-to-end data analytics solution, enabling our organization to manage, process, and derive insights from their data.

Key Benefits of the Integrated Approach
This integrated solution offers several benefits. Azure Data Factory facilitates seamless data movement between various sources, while Snowflake serves as a highly scalable and elastic data warehouse. Power BI is used to visualize and analyze the data. Both Azure Data Factory and Snowflake are designed to handle large volumes of data efficiently, and Power BI leverages this scalability for robust analytics.

Scalability and Cost Efficiency
The new architecture offers flexibility with Azure Data Factory supporting diverse data sources, Snowflake allowing for the storage of structured, semi-structured, and unstructured data, and Power BI providing flexible reporting and dashboard creation. The data movement capabilities of Azure Data Factory, the diverse data transformations supported by Snowflake, and the ability of Power BI to transform and model data for business insights, offer a comprehensive solution for various Data Products. Both Azure Data Factory and Snowflake operate on a pay-as-you-go model, enabling cost-efficient scaling based on actual usage.

Cloud Security Best Practices
Security integration between the services is enabled by Snowflake SCIM provisioning via Azure Enterprise Apps, Snowflake storage and notification integrations for Azure ADLS Gen 2 and Blob storage, Snowflake KeyPair provisioning for programmatic data access, and Snowflake OAuth for Power BI for end-user SSO. Azure Key Vault is used for safeguarding keys and secrets used by our cloud applications and services.

Unified Data Platform and Enhanced User Experience
When combined and integrated, these services offer a unified architecture for data storage, processing, and visualization, streamlining the entire data lifecycle. Power BI can utilize the AI and Machine Learning features of Snowflake for sophisticated analytics, delivering predictive and prescriptive insights. The end-to-end process is user-friendly for data engineers, analysts, and business users alike, thanks to Snowflake’s user-friendly environment, the visual design of Azure Data Factory, and the intuitive interface of Power BI.

Conclusion

By: Sean Morris, Sales Engineer @ Snowflake

DSS&S’ journey to a self-service data mesh, powered by Snowflake and Azure, showcases the strategic alignment of principles and meticulous implementation of core changes. The six key transformations, including separating Snowflake environments and revamping RBAC, played a crucial role in achieving goals like consistent patterns, automated data integration, and granular security.

Insights from team members, such as Christopher Witcher’s emphasis on empowering Data Engineers with automation and Jennifer Yoakum’s exploration of RBAC and SchemaChange, provided a deep understanding of the intricacies involved. Matthew Massey highlighted the shift from a legacy monolithic architecture to a modern approach integrating Azure Data Factory, Snowflake, and Power BI, offering seamless integration, scalability, and cost efficiency.

In summary, Pfizer’s success lies in the strategic adoption of modern technologies, meticulous implementation of changes, and a commitment to security and governance. The collaboration between Snowflake and Azure has propelled the DSS&S towards offering a robust self-service data mesh, marking a significant achievement in the data analytics landscape.

Special thanks to the DSS&S team inspiring this amazing work.

Samia Rahman — Enterprise Data Strategy
Marty Hall — Enterprise Data and Analytics
Gary Kretzschmar — Data Analytics Platforms
Christopher Witcher — Principal Platform Engineer
Jennifer Yoakum — Senior Platform Engineer
Matthew Massey — Senior Platform Engineer

Disclaimers:

- This presentation includes examples of solutions that we have used or evaluated, but they do not imply any endorsement or recommendation by us or our company.

- Please be aware that the references to specific tools or vendors in this blog article are for informational purposes only and do not constitute an endorsement or preference by our team or organization. We encourage you to explore other alternatives and decide what works best for your platform needs.

How Pfizer Achieved Self-Service Data Mesh with Snowflake and Azure

Introduction

Project Goals

Automating CICD via Azure, Git, and Terraform

Introduction/Stage Setting

Architecture

Defining the infrastructure of a Data Product

Git Branching vs. Data Product Environment

Integration between services details

IaaS Repo Structure

Code Repo Structure

Lessons learned — challenges/solutions

Automating RBAC, Provisioning with SCIM and Terraform, and CICD with SchemaChange

Introduction/Stage Setting

Architecture & Integrations

Pattern-based RBAC automation

Snowflake + Azure

Snowflake + Terraform

Snowflake + SchemaChange (CICD)

Lessons learned — challenges/solutions

Powering Data Ingestion with Azure Data Factory and Data Consumption in Snowflake and Power BI

Conclusion

Written by Matt Massey