Using Dataproc Serverless to migrate your Dataplex GCS data to Bigquery

Published in

Google Cloud - Community

4 min readDec 14, 2022

We can use Dataproc Serverless to run Spark batch workloads without provisioning and managing our own cluster. We can specify workload parameters, and then submit the workload to the Dataproc Serverless service.

Dataproc Serverless helps the users with the whole job of infrastructure management — to execute their Apache Spark workloads users are not required to create a cluster first to execute anything. Users can simply select a template as per their use case and perform their respective job with a few clicks and commands.

Objective

This blog post will share complete end to end details on how you can use “Dataplex GCS to Bigquery Dataproc Serverless Template” for data migration. This template will incrementally move data from Dataplex GCS tables to BigQuery. It will identify new partitions in Dataplex GCS and will load them to BigQuery.

Note: if the Dataplex GCS table has no partitions, the whole table will be read from GCS and the target BigQuery table will be overwritten.

Setup your GCP Project and Infra

Login to your GCP Project and enable Dataproc API(if it is disabled)
Make sure that the subnet is enabled with Private Google Access, if you are going to use “default” VPC Network generated by GCP then also, you will have to enable private access as shown below:

gcloud compute networks subnets update default --region=us-central1 --enable-private-ip-google-access

3. Create a GCS bucket and staging location for jar files.

export GCS_STAGING_BUCKET=”my-gcs-staging-bucket”
gsutil mb gs://$GCS_STAGING_BUCKET

4. To configure the Dataproc Serverless job, you need to export the following variables:-

GCP_PROJECT : GCP project id to run Dataproc Serverless on.

REGION : Region to run Dataproc Serverless in.

GCS_STAGING_LOCATION : GCS staging bucket location, where Dataproc will store staging assets (See Step 3).

Steps to execute Dataproc Template

Clone the Dataproc Templates repository and navigate to the Java template folder.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java

2. Get authentication credentials (to submit the job).

gcloud auth application-default login

3. Configure the Dataproc Serverless job by exporting the variables needed for submission(as explained in Step 4 of “Setup you GCP Project & Infra”).

export GCP_PROJECT=<project_id> # your Google Cloud project
export REGION=<region> # your region for ex: us-central1
export SUBNET=<subnet> # optional if you are using default
# export GCS_STAGING_LOCATION=<gcs-staging-bucket-folder> # already done at step 3(Under Setup your GCP Project & Infra)

4. Download Jar and Properties file.

export GCS_STAGING_LOCATION=gs://bucket/path/to/staging/folder/
gsutil -u <billing-project-id> cp gs://dataplex-dataproc-templates-artifacts/dataproc-templates-1.0-SNAPSHOT.jar ${GCS_STAGING_LOCATION}
gsutil -u <billing-project-id> cp gs://dataplex-dataproc-templates-artifacts/log4j-spark-driver-template.properties ${GCS_STAGING_LOCATION}

5. Gather the values for the following parameters:

project.id : Id of the GCP project where the target BigQuery dataset and custom SQL file are located.

dataplex.gcs.bq.target.dataset : Name of the target BigQuery dataset where the Dataplex GCS asset will be migrated.

gcs.bigquery.temp.bucket.name : The GCS bucket that temporarily holds the data before it is loaded to BigQuery.

dataplex.gcs.bq.save.mode : Specifies how to handle existing data in BigQuery(if present). Can be any of the following: errorifexists, append ,overwrite, ignore. Defaults to errorifexists \

dataplex.gcs.bq.incremental.partition.copy : Specifies if the template should copy new partitions only or all the partitions. If set to no existing partitions, and if found then it will be overwritten. Can be any of the following yes, no. Defaults to yes

6. Supply the right arguments:

--dataplexEntity : Dataplex GCS table to be loaded to BigQuery
Example: --dataplexEntityList "projects/{project_number}/locations/{location_id}/lakes/{lake_id}/zones/{zone_id}/entities/{entity_id_1}"

--partitionField : If the field is specified together with partitionType, the table is partitioned by this field. This field should be a top-level TIMESTAMP or DATE field.

--partitionType : Supported types are: HOUR, DAY, MONTH, YEAR

--targetTableName : Specifies the name of the table where data will be written. If this argument is not specified, the name of the entity will be used as table name.

7. Execute the below command:

gcloud dataplex tasks create <task-id> \
    --project=<project-id> \
    --location=<region> \
    --vpc-sub-network-name=<subnet> \
    --lake=<dataplex-lake> \
    --trigger-type=ON_DEMAND \
    --execution-service-account=<execution service account> \
    --spark-main-class="com.google.cloud.dataproc.templates.main.DataProcTemplate" \
    --spark-file-uris="${GCS_STAGING_LOCATION}log4j-spark-driver-template.properties" \
    --container-image-java-jars="${GCS_STAGING_LOCATION}dataproc-templates-1.0-SNAPSHOT.jar" \
    --execution-args=^::^TASK_ARGS="--template=DATAPLEXGCSTOBQ,\
        --templateProperty=project.id=<project-id>,\
        --templateProperty=dataplex.gcs.bq.target.dataset=<dataset_name>,\
        --templateProperty=gcs.bigquery.temp.bucket.name=<temp-bucket-name>,\
        --templateProperty=dataplex.gcs.bq.save.mode=append,\
        --templateProperty=dataplex.gcs.bq.incremental.partition.copy=yes,\
        --dataplexEntity=projects/{project_number}/locations/{location_id}/lakes/{lake_id}/zones/{zone_id}/entities/{entity_id_1},\
        --partitionField=<partition_field>,\
        --partitionType=<DAY>,\
        --targetTableName=<table_name>,\
        --customSqlGcsPath=<gs://bucket/path/to/custom_sql.sql>"

Note:- Optionally a custom SQL can be provided to filter the data that will be copied to BigQuery.
The template will read from a GCS file with a custom sql string. The path to this file must be provided with the option --customSqlGcsPath. Custom SQL must reference __table__ in the FROM clause as shown in the following example:

SELECT 
    col1, col2
FROM
    __table__
WHERE 
    id > 100

Also, In case you need to specify spark properties supported by Dataproc Serverless for ex: adjust the number of drivers, cores, executors etc — You can edit the OPT_PROPERTIES values in start.sh file.

6. Monitor the Spark batch job

After submitting the job, we will be able to see it in the Dataproc Batches UI. From there, we can view both metrics and logs for the job.

References

For any queries/suggestions please reach out to: dataproc-templates-support-external@googlegroups.com

Using Dataproc Serverless to migrate your Dataplex GCS data to Bigquery

Objective

Setup your GCP Project and Infra

Steps to execute Dataproc Template

References

Written by Vasu Mittal