Export Snowflake query results to GCS using Dataproc Serverless
Dataproc Serverless runs batch workloads without provisioning and managing a cluster. It simply manages all the infrastructure provisioning and management behind the scenes.
Key Benefits:
- It can dynamically scale workload resources, such as the number of executors, to run your workload efficiently.
- Fully managed and serverless solution.
- Ephemeral, resources are released once the job ends.
Dataproc Serverless Templates: Ready to use, open sourced, customisable templates based on Dataproc Serverless for Spark. These templates help the data engineers to further simplify the process of development on Dataproc Serverless, by consuming and customising the existing templates as per their requirements.
In this post we will explore how we can export the data from a Snowflake table to GCS using Dataproc Serverless.
Snowflake to GCS template
The template reads data from Snowflake table or a query result and writes it to a Google Cloud Storage location. It uses the Snowflake Connector for Spark, enabling Spark to read data from Snowflake.
The template allows the following parameters to be configured through the execution command:
- snowflake.gcs.sfurl: Snowflake account URL.
Format: <account-identifier>.snowflakecomputing.com - snowflake.gcs.sfuser: Snowflake username
- snowflake.gcs.sfpassword: Snowflake user password
- snowflake.gcs.sfdatabase: Snowflake database name
- snowflake.gcs.sfschema: Snowflake schema name
- snowflake.gcs.sfwarehouse: Snowflake warehouse
Note:
◦ This is an optional property
◦ If not specified explicitly through execution command, the default warehouse configured at Snowflake would be considered. - snowflake.gcs.autopushdown: Snowflake query pushdown feature
Note:
◦ This is an optional property
◦ If not specified explicitly through execution command, the default value is on. - snowflake.gcs.table: Snowflake input table
- snowflake.gcs.query: Snowflake select query
Note: Either one of the template properties snowflake.gcs.table and snowflake.gcs.query must be provided. - snowflake.gcs.output.location: GCS output location
Format: gs://<bucket-name>/<dir> - snowflake.gcs.output.format: GCS output file format.
Accepted values: csv, avro, orc, json or parquet
Run the Template
- Ensure you have enabled the subnet with Private Google Access. If you are using “default” VPC created by GCP, you will still have to enable private access as below.
gcloud compute networks subnets update default --region=us-central1 --enable-private-ip-google-access
2. Create a GCS bucket and staging location for jar files.
3. Clone git repo in a cloud shell which is pre-installed with various tools. Alternatively use any machine pre-installed with JDK 8+, Maven and Git.
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java
4. Authenticate the gcloud CLI
gcloud auth login
5. Execute SnowflakeToGCS template
Sample execution command:
export GCP_PROJECT=<gcp-project-id>
export REGION=<gcp-project-region>
export GCS_STAGING_LOCATION=<gcs-bucket-staging-folder-path>
export SUBNET=<gcp-project-dataproc-clusters-subnet>
bin/start.sh \
-- \
--template SNOWFLAKETOGCS \
--templateProperty snowflake.gcs.sfurl <snowflake-account-url> \
--templateProperty snowflake.gcs.sfuser <snowflake-user> \
--templateProperty snowflake.gcs.sfpassword <snowflake-user-password> \
--templateProperty snowflake.gcs.sfdatabase <snowflake-database> \
--templateProperty snowflake.gcs.sfschema <snowflake-schema> \
--templateProperty snowflake.gcs.sfwarehouse <snowflake-warehouse> \
--templateProperty snowflake.gcs.query <snowflake-select-query> \
--templateProperty snowflake.gcs.output.location <gcs-output-location> \
--templateProperty snowflake.gcs.output.format <csv|avro|orc|json|parquet> \
--templateProperty snowflake.gcs.output.mode <Overwrite|ErrorIfExists|Append|Ignore> \
--templateProperty snowflake.gcs.output.partitionColumn <gcs-output-partitionby-columnname> \
--templateProperty snowflake.gcs.autopushdown <on|off>6. Monitor the Spark batch jo
6. You can monitor logs and view the metrics after submitting the job in Dataproc Batches UI.
Other advanced job configuration
HISTORY_SERVER_CLUSER: An existing Dataproc cluster to act as a Spark History Server. This property can be used to specify a dedicated server, where you can view the status of running and completed Spark jobs. Example:
export HISTORY_SERVER_CLUSER=projects/<project_id>/regions/<region>/clusters/<cluster_name>
SPARK_PROPERTIES: In case you need to specify spark properties supported by Dataproc Serverless like adjust the number of drivers, cores, executors etc. Use this to gain more control over the Spark configurations. Example:
export SPARK_PROPERTIES=spark.executor.instances=50,spark.dynamicAllocation.maxExecutors=200
References
- Dataproc Serverless Documentation
- Dataproc Templates GitHub Repository
- Snowflake Connector for Spark
- Medium — Cloud Spanner export query results using Dataproc Serverless
For any queries or suggestions reach out to: dataproc-templates-support-external@googlegroups.com