Cross-Cloud Resource Management with Terraform + Snowflake

Prasanth Kommini
HashiCorp Solutions Engineering Blog
6 min readApr 21, 2022

--

Terraform modules to aid Snowflake native data engineering.

Overview

In a previous article, we introduced Generic External Function Framework (GEFF), an extensible backend API framework implementing generic remote services for Snowflake External Functions (EFs). In this article, we demonstrate how HashiCorp Terraform enables consistent deployment of Snowflake API Integration with GEFF running on AWS Lambda as the backend and the Storage Integration Terraform module integrating with S3 for ingesting responses from asynchronous invocations of GEFF.

Even with pre-provided templates for AWS CloudFormation, the deployment of EFs is a 5-step process and involves an operator manually orchestrating cross-cloud resource creation and linking. This creates ample opportunity for missed steps, dangerous shortcuts, and even simple “typo” errors. It also adds decisions to the operators’ plates, which could lead to redundant Threat Modeling and a plurality of monitoring systems and mitigations.

Thanks to Terraform’s multi-cloud resource creation capabilities, we can deploy the API & Storage Integration and supporting Infrastructure resources quickly, repeatedly, and consistently, within a variety of organizations. Not only does Terraform provide a lingua franca for infrastructure development, but thanks to the Terraform Registry, we also have a place to collaborate, share, version control and extend modules that build infrastructure and Snowflake functionality.

Additionally, we encourage users who operate on Microsoft Azure or Google Cloud to contribute to our modules with compatible interfaces, further reducing the maintenance costs of cross-cloud work and migrations. Finally, because Terraform is infrastructure as code, we have the option to build automated testing infrastructure that can deploy and verify changes to our code instead of a manual QA process.

Problem

We needed to accomplish the following:

  1. Simplify the EF base infrastructure creation process to a turnkey deployment in a consistent, reusable, and secure manner
  2. Use a single tool capable of creating resources across AWS Infrastructure Cloud and Snowflake Data Cloud
  3. Introduce and setup trust relationships between the resources across clouds such that they’re able to securely communicate with each other

Solution

We addressed the requirements by using Terraform, which allowed us to:

  1. Implement Terraform modules to create a Snowflake API Integration with a GEFF backend via AWS API Gateway and Lambda for compute, and Snowflake Storage Integration with AWS S3 for storage. Additionally, we published the modules in the HashiCorp Registry where they are version controlled
  2. Use multiple providers in a single Terraform module to create resources across multiple clouds and adhere to Snowflake RBAC best practices within our modules
  3. Use Terraform outputs and resource creation order to perform cross-cloud resource introduction and trust

Snowflake API and Storage Integration Terraform Modules

Figure 1: API & Storage Integration Modules along with the GEFF API Backend
Figure 1: API & Storage Integration Modules along with the GEFF API Backend

Creating the infrastructure required by a Snowflake Storage (or API) Integration becomes as simple as running:

Commands to create the Storage Integration and corresponding AWS Resources.

These are the files required to create the base infrastructure (API Integration + AWS resources) for a simple Snowflake native ETL pipeline:

  1. main.tf
  2. geff.auto.tfvars
  3. variables.tf
  4. provider.tf

Terraform sets up the following resources across the two clouds:

The API Integration Terraform module in turn leverages the Storage Integration module for the storage layer and enables us to build more complex pipelines, such as the one mentioned in [1]. The power of these Terraform modules is in the automation of the infrastructure creation parts of the following key components of a Snowflake native ETL, which can be intricate and time consuming to do manually:

  1. EF creation with AWS
  2. Snowpipe creation: Option 2

This section described the two Terraform modules we implemented and how they can be useful to you. The next section talks about how we used Terraform to perform cross-cloud resource creation.

Single Module, Multiple Providers, Multiple Clouds

Snowflake EFs (as well as External Tables and Snowpipes) all need to do the following to integrate with AWS:

  1. Create AWS Infrastructure Cloud IAM, Compute, Network, and Storage resources
  2. Create Snowflake Data Cloud account-level objects such as API Integrations and Storage Integrations
  3. Introduce and setup trust between resources across (1) and (2)
  4. Create the Snowflake DB-level objects like EFs, ETs, Stages & Snowpipes

Our API Integration with GEFF Backend and Storage Integration Terraform modules automate steps (1), (2) and (3) above. We intend to implement separate Terraform modules to create the pipelines to build on top of the API and Storage Integration modules towards step (4).

Terraform allows us to initialize multiple Terraform providers and hence we create separate providers for AWS and Snowflake. The following example shows how to create the providers and pass them into the Snowflake API Integration Terraform module:

Additionally, Terraform allows us to create multiple providers for the same cloud using the alias attribute and reference them independently. Our integration modules map the providers to the appropriate cloud resource by passing the corresponding provider attribute as seen here. Thus, we can instantiate multiple Snowflake providers each with different role attributes and use them to create different types of Snowflake objects. For example, we can use different roles to create account-level and database-level objects using different Terraform providers created with granular grants that leverage the Snowflake RBAC system. This is a powerful feature and we use it to enforce RBAC in our modules by requiring api_integration and storage_integration providers to be passed into the module by using the required_providers block as seen below:

Code to enforce RBAC in our integration Terraform modules.

This section described how we accomplished cross-cloud resource creation and enforced RBAC using Terraform providers. The next section touches on setting up cross-cloud communication.

Cross-Cloud Resource Trust Relationship

When a Snowflake API Integration is created Snowflake generates two key attributes that we need to perform cross-cloud resource linking:

  1. API_IAM_USER_ARN looks like arn:aws:iam::12345:user/user-id
  2. API_AWS_EXTERNAL_ID looks like <instance_name>_SFCRole_<randomID>

Snowflake leverages the external ID to set up trust relationships between cross-cloud IAM as documented here. We use CZI’s Terraform Snowflake provider to create the Storage Integration and then use the resource’s output attributes to set up the Trust Relationship using the Terraform AWS provider’s aws_iam_role resource and its assume_role_policy property.

We use Terraform’s string interpolation to construct the aws_iam_role_arn, allowed_prefixes and allowed_locations that are required to create the API/Storage Integration. We’re able to do this because Snowflake doesn’t expect the aws_iam_role_arn and allowed_prefixes or allowed_locations to actually exist at the time of creating the integrations, thus allowing a strategy for cyclic dependency creation without actual having cyclic dependencies in the Terraform code. Coupled with Terraform’s ability to derive the resource dependencies as a DAG, this allows the resources to be created efficiently and repeatedly, setting up a robust cross-cloud trust relationship in a turnkey deployment.

Conclusion

To summarize, we introduced our Terraform modules:

  1. API Integration with GEFF API as backend [3]
  2. Storage Integration [4]

We also demonstrated features of Terraform that we use in our modules, such as:

  1. Terraform registry and Terraform module versioning
  2. Providers, provider aliases, required_providers
  3. Outputs and derivation of order creation through resource DAGs

And finally went into some detail as to how the above features allowed us to:

  1. Create resources across multiple clouds, introduce, and securely link them
  2. Enforce RBAC in API and Storage Integration modules
  3. Transfer concerns of cyclical resource creation order from the infrastructure operators to Terraform

--

--