Snowflake: Integration with Google Cloud Storage
In this blog post, we will learn how Snowflake storage integration can be used to read data from and write data to the Google Cloud Storage bucket.
Snowflake is one of the most popular cloud-agnostic Software-as-a-Service (SaaS) based cloud data warehouses that bring simplicity without compromising features. Snowflake is a multi-purpose cloud data warehouse that can be used for data lakes, operational data stores, data warehouses, and data marts. Snowflake enables data storage, processing, and analytic solutions that are faster, easier to use, and far more flexible than traditional offerings. Its automatic vertical and horizontal scalability and decoupled Compute and Storage architecture help balance performance and operational cost.
Google Cloud Storage is the object storage service offered by Google Cloud. It’s designed for at least 99.999999999% annual durability or 11 nines, which means that even with one billion objects, you would likely go a hundred years without losing a single one. Moreover, it provides out-of-the-box features such as object versioning and fine-grain permissions that can make development hassle-free and help reduce operational costs.
Before deep-diving into the Snowflake integration with Google Cloud Storage, let’s understand the integration flow with GCP. Also, let’s familiarize ourselves with the terminologies used throughout the integration steps.
Integration Flow
Integration objects help connect external storage such as GCP Cloud Storage with Snowflake. GCP IAM web service can be used to provide fine-grained access to external storage, as shown in Figure 1. Also, many external stage objects can reference different buckets and paths and use the same storage integration for authentication.
Terminologies Used
- Google Cloud Storage: A storage container to store objects in the cloud
- IAM: GCP Identity and Access Management (IAM) is a web-based service that controls access to GCP resources.
- IAM Policy: Mechanism to define and enforce permissions on identities such as users, groups, and roles
- Roles: Set of permissions that can be delegated to users
- Users: An entity that can be used to interact with GCP
- Storage Integration: Snowflake object stores generated IAM users and a set of allowed or blocked storage locations.
- Snowflake External Stage: Used to store metadata about external data files such as Google Cloud Storage bucket URL and file format.
Integration Steps
- Create a Cloud Storage Bucket
- Login into a GCP account → View all products
- Select Storage → Cloud Storage
- Click CREATE BUCKET
- Enter Name your bucket → Click CREATE
2. Create a Cloud Storage Integration in Snowflake
Cloud storage integration holds external storage metadata such as storage provider, role (generated through IAM service), and object storage (GCS Bucket) location. Creating this external storage means you don’t need to provide credentials when creating stages or loading data. For accessing Cloud Storage buckets, Snowflake creates a service account that can be granted permissions to access the bucket(s) that store your data files. Below is the code command to create an integration;
CREATE STORAGE INTEGRATION <integration_name>
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = GCS
ENABLED = TRUE
STORAGE_ALLOWED_LOCATIONS = (‘gcs://<bucket>/<path>’)
3. Retrieve the Cloud Storage Service Account
After creating storage integration in Step 2, run the following command to retrieve the Cloud Storage service account for your Snowflake account.
DESC STORAGE INTEGRATION <integration_name>;
4. Grant the Service Account Permissions to Access Bucket Objects
To read data from and write data in the Cloud Storage bucket, Let’s configure IAM access permissions for Snowflake in GCP by following the below steps;
- Create a Custom IAM Role
- Log into the GCP Console as a project editor → IAM & admin → Roles → Create Role
- Enter the Title and Description of the role → ADD PERMISSIONS
- Filter the permission list, and assign the following permissions to the role → ADD
- Click CREATE
- Assigning the Customer Role to the Cloud Storage Service Account
- Go to Cloud Storage → Browser → Select a bucket to configure access → SHOW INFO PANEL in the upper-right corner
- Add New principals → Role
5. Create an External Stage
Finally, we will create an external stage to hold the cloud storage integration metadata such as storage integration name, Cloud Storage bucket URL, and file format. However, before creating the external stage, we need to create a file format in Snowflake so that it can be referenced in the External Stage creation. Here’s the code snippet to create a file format in Snowflake;
CREATE OR REPLACE FILE format my_csv_format
TYPE = csv
FIELD_DELIMITER= '|'
SKIP_HEADER = 1
NULL_IF = ('NULL', 'null')
EMPTY_FIELD_AS_NULL = true
COMPRESSION= gzip;
To verify file format, run the below command;
DESC FILE FORMAT <file_format_name>;
Next, let’s create the External Stage by using the following code snippet;
USE SCHEMA SNOWFLAKE_GCS_INTEGRATION.PUBLIC;CREATE STAGE gcs_stage
STORAGE_INTEGRATION = <Integration name>
URL = <GCS Bucket URL>
FILE_FORMAT = <File format name>;
Finally, we can run the following command to read the data from the GCS bucket, as shown below;
LIST <Stage_name>;
Briefly, this post outlined the steps to successfully integrate Snowflake with Google Cloud Storage (GCS) with all the required code snippets. I have also included screenshots of each step, which will help people understand and compare their outputs.
Learning never exhausts the mind — Leonardo da Vinci