Export Snowflake query results to GCS using Dataproc Serverless

Vanshaj Bhatia
Google Cloud - Community
3 min readDec 7, 2022

Dataproc Serverless runs batch workloads without provisioning and managing a cluster. It simply manages all the infrastructure provisioning and management behind the scenes.

Key Benefits:

  1. It can dynamically scale workload resources, such as the number of executors, to run your workload efficiently.
  2. Fully managed and serverless solution.
  3. Ephemeral, resources are released once the job ends.

Dataproc Serverless Templates: Ready to use, open sourced, customisable templates based on Dataproc Serverless for Spark. These templates help the data engineers to further simplify the process of development on Dataproc Serverless, by consuming and customising the existing templates as per their requirements.

In this post we will explore how we can export the data from a Snowflake table to GCS using Dataproc Serverless.

Snowflake to GCS template

The template reads data from Snowflake table or a query result and writes it to a Google Cloud Storage location. It uses the Snowflake Connector for Spark, enabling Spark to read data from Snowflake.

The template allows the following parameters to be configured through the execution command:

  1. snowflake.gcs.sfurl: Snowflake account URL.
    Format: <account-identifier>.snowflakecomputing.com
  2. snowflake.gcs.sfuser: Snowflake username
  3. snowflake.gcs.sfpassword: Snowflake user password
  4. snowflake.gcs.sfdatabase: Snowflake database name
  5. snowflake.gcs.sfschema: Snowflake schema name
  6. snowflake.gcs.sfwarehouse: Snowflake warehouse
    Note:
    ◦ This is an optional property
    ◦ If not specified explicitly through execution command, the default warehouse configured at Snowflake would be considered.
  7. snowflake.gcs.autopushdown: Snowflake query pushdown feature
    Note:
    ◦ This is an optional property
    ◦ If not specified explicitly through execution command, the default value is on.
  8. snowflake.gcs.table: Snowflake input table
  9. snowflake.gcs.query: Snowflake select query
    Note: Either one of the template properties snowflake.gcs.table and snowflake.gcs.query must be provided.
  10. snowflake.gcs.output.location: GCS output location
    Format: gs://<bucket-name>/<dir>
  11. snowflake.gcs.output.format: GCS output file format.
    Accepted values: csv, avro, orc, json or parquet

Run the Template

  1. Ensure you have enabled the subnet with Private Google Access. If you are using “default” VPC created by GCP, you will still have to enable private access as below.
gcloud compute networks subnets update default --region=us-central1 --enable-private-ip-google-access

2. Create a GCS bucket and staging location for jar files.

3. Clone git repo in a cloud shell which is pre-installed with various tools. Alternatively use any machine pre-installed with JDK 8+, Maven and Git.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java

4. Authenticate the gcloud CLI

gcloud auth login

5. Execute SnowflakeToGCS template

Sample execution command:

export GCP_PROJECT=<gcp-project-id>
export REGION=<gcp-project-region>
export GCS_STAGING_LOCATION=<gcs-bucket-staging-folder-path>
export SUBNET=<gcp-project-dataproc-clusters-subnet>

bin/start.sh \
-- \
--template SNOWFLAKETOGCS \
--templateProperty snowflake.gcs.sfurl <snowflake-account-url> \
--templateProperty snowflake.gcs.sfuser <snowflake-user> \
--templateProperty snowflake.gcs.sfpassword <snowflake-user-password> \
--templateProperty snowflake.gcs.sfdatabase <snowflake-database> \
--templateProperty snowflake.gcs.sfschema <snowflake-schema> \
--templateProperty snowflake.gcs.sfwarehouse <snowflake-warehouse> \
--templateProperty snowflake.gcs.query <snowflake-select-query> \
--templateProperty snowflake.gcs.output.location <gcs-output-location> \
--templateProperty snowflake.gcs.output.format <csv|avro|orc|json|parquet> \
--templateProperty snowflake.gcs.output.mode <Overwrite|ErrorIfExists|Append|Ignore> \
--templateProperty snowflake.gcs.output.partitionColumn <gcs-output-partitionby-columnname> \
--templateProperty snowflake.gcs.autopushdown <on|off>6. Monitor the Spark batch jo

6. You can monitor logs and view the metrics after submitting the job in Dataproc Batches UI.

Other advanced job configuration

HISTORY_SERVER_CLUSER: An existing Dataproc cluster to act as a Spark History Server. This property can be used to specify a dedicated server, where you can view the status of running and completed Spark jobs. Example:

export HISTORY_SERVER_CLUSER=projects/<project_id>/regions/<region>/clusters/<cluster_name>

SPARK_PROPERTIES: In case you need to specify spark properties supported by Dataproc Serverless like adjust the number of drivers, cores, executors etc. Use this to gain more control over the Spark configurations. Example:

export SPARK_PROPERTIES=spark.executor.instances=50,spark.dynamicAllocation.maxExecutors=200

References

For any queries or suggestions reach out to: dataproc-templates-support-external@googlegroups.com

--

--