GCS to Spanner : Dataproc Serverless Template creation using UI

Megha GD
Google Cloud - Community
4 min readOct 19, 2023

Dataproc is a managed Spark and Hadoop service in GCP that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning.

Introduction

Dataproc Serverless can run Spark batch workloads without provisioning and managing any cluster thereby reducing process overhead and development time.

GCP provides a collection of pre-implemented Dataproc templates which are written in Spark using Java, Pyspark , Jupyter notebooks as a reference to provide easy customization for developers wanting to extend their functionality and can be configured easily by passing the required arguments and trigger spark batch workloads. It involves several steps like setting up environment variables, building the maven project or python package and executing the templates using the gcloud CLI.

Now GCP has made it even easier as you can set up a dataproc serveless job without writing a single line of code. You can set up a job using Google Cloud console UI just by selecting the required template in the drop down menu , providing necessary arguments and GCP will take care of all the necessary infrastructure in the background and build the required code using the template in the backend.

Objective

This blog post will provide end to end details on how to use the “Create from template” option on Google Cloud Console UI for Dataproc serverless to set up a batch job just by selecting the required option in the dropdown menu to import the data from GCS to Spanner. You can check this blogpost to run the same template using gcloud CLI to understand the difference in the mode of execution.

Prerequisites

Setup your GCP Infra , Storage bucket configuration and Spanner table.

  • Login to your GCP Project and enable Dataproc API(if it is disabled)
  • Make sure that the subnet is enabled with Private Google Access.
  • Create a GCS bucket and upload the sample data in avro/parquet/orc format.
  • Create Spanner instance , spanner db and create the destination table.

Configure the Template

To create a batch job to import data from GCS to Spanner and configure with required arguments , Navigate to Dataproc → Serverless → Batch and click on the “Create from template” option on Cloud Console.

  1. Select the required template from the dropdown menu , in our case GCS to Spanner.

2. Once you select the required template , the UI will render the option below where you need to provide the required parameters.

3. Additionally , you can provide save mode as per requirement , batchInsertSize just like arguments you pass in a template.

4. Monitor the Spark batch job , after submitting the job, we will be able to see it in the Dataproc Batches UI. check both metrics and logs for the job.

5. Once job succeeds , Query spanner tables to check the count and data.

Few things to consider

  • Input GCS path location : Provide absolute path where files are placed , you can also provide regex , i.e , *.parquet . Supported formats are avro/parquet/orc format
  • Output spanner batch insert size : You can fine tune the insert batch size for GCSToSpanner job
  • History server cluster : you can provide an existing Dataproc cluster to act as a Spark History Server which can store logs and you can view the status of running and completed Spark jobs.
  • Properties : In case you need to specify spark properties supported by Dataproc Serverless like adjust the number of drivers, cores, executors etc. Use this section to gain more control over the Spark configurations. Example , if you want to further improve speed of execution by increasing parallelism , you can set higher number of executors as below : Under properties add property key as spark.executor.instances and spark.dynamicAllocation.maxExecutors . Provide value as 50 , 200 respectively.

Advantages / Usage

  • Simplified development: Developers can create and manage features without writing any code, which can significantly reduce development time and effort.
  • Reusability: Developers can focus on more complex tasks, such as if any business logic needs to be incorporated into the template , they can use this template as a foundational template and git clone the repository and add custom logic to it.
  • AdHoc use cases such as a business analyst who wants to push a lot of large files from GCS to BigQuery can use a UI based feature to load the files from GCS to BigQuery, without having to write any code.

This feature is under preview now and soon to be GA(General Availability).

Thank you for taking the time to go through this blog.

--

--